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Preface 


Texts in measurements reflect the trends in that field. Until recently, the 
problem of the instructor in psychological testing has been to acquaint his 
students with the instruments available, and textbooks have been essentially 
descriptive catalogs of tests. The years have seen such a multiplication of 
good and necessary devices that it is now hopeless to try to introduce the 
student to all the significant tests. Moreover, it is clear that more and more 
tests will be developed. In clinical psychology especially, there is a tendency 
to demand a great variety of tests rather than to seek a few best tests for 
each function. Tire clinician prefers to draw, from a huge armamentarium, 
the particular tools which best fit each individual client. In educational and 
industrial practice as well, we are learning that the test which best solves 
one problem may be of little value in other situations. 

If we cannot introduce the student to all the tests he will use, the course 
in' testing must have a different function. The obvious need is for a course 
which will present the basic principles of testing in such a way that the 
student will learn to choose tests wisely for particular needs, and so that he 
will be aware of the weaknesses of whatever tests he uses. Thus he will be 
enabled to cope with new tests as they emerge. It is well that experience and 
research on tests have unearthed a large vein of significant general princi- 
ples. 

While current textbooks should stress these general principles, specific 
tests need not be lost sight of. The student must surely know the tests most 
widely used and most widely cited in the literature, and a critical study of 
them is the best way to understand the principles. The student who under- 
stands how the Multiphasic differs from the Bemreuter is in a position to 
make sound decisions about any present paper-pencil test of personality. 
The student who can compare completely the Wechsler test with the Pri- 
mary Mental Abilities batteiy applies every principle we know about intel- 
ligence testing. In choosing tests for study, emphasis has been placed on ( 1 ) 
tests having wide application, (2) tests which clearly illustrate significant 
principles either in the breach or the observance, and ( 3 ) tests which are 
prototypes of important techniques. Necessarily, many tests have been omit- 
ted, including some which are greatly respected. The space given to a test 
in this book is not in proportion to the goodness of the test. So that the 
student will have ready access to a wider listing of tests than we can discuss, 
a summary listing of tests accompanies many chapters. Even this listing has 
been selective. So that the student will have some idea whether a test merits 
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further consideration for some purpose he has in mind, a column of critical 
comments has been added to each table, These comments are chosen from 
competent reviews of the test. No matter how careful one is in selecting 
such quotations, they cannot represent fully the thought of the ■reviewers. In 
many cases, it has been necessary to cite one out of several reviews of a test. 
Users of the book are therefore cautioned against treating these comments as 
final judgments; reference to the original source or to other similar reviews 
is required to obtain a full and balanced report on the test. 

Testing has been furthered by two lines of emphasis in American psychol- 
ogy: one, the practical and clinical application of instruments; the other, the 
theoretical and statistical analysis of testing problems. So far have these two 
specialties drifted apart that the person trained in psychometric theory can- 
not accept some important instruments of the clinician, because clinicians 
often have faith in devices which do not satisfy the psychometric criteria. 
Yet if these devices were rejected, the clinician would be deprived of truly 
helpful tools. The other face of the picture is that the clinician is often dis- 
satisfied with the precisely designed and statistically purified tests that come 
from the psychometric school, because they do not give the information he 
needs about his clients. Future research and courses in psychology must 
lead to a mutual understanding between these two equally necessary groups 
of specialists. To be welcomed, then, is any training which provides both 
with common basic concepts. On such a common course in principles of 
psychological testing, specialized study of industrial, clinical, or educational 
tests, and statistics, theory of measurements, and so on, may be founded. The 
clinician must know how the psychometrically trained specialist evaluates 
tests, but the test technician must see why tests he rates as poor do function, 
in practical conditions, better than the others he gives a clean bill of health. 
The obvious goal is to develop tests which merit the full approval of both 
groups, as few of our present devices do. 

Every vsriter on testing faces the problem: How technical shall the pres- 
entation be? It is impossible to say some things that need to be said without 
resorting to statistics, and that means that the student must be able to grasp 
those statistics intelligently — Whence a chapter early in the book on simple 
statistical methods, suflScient to give a little meaning to statements about 
correlation. In discussing reliability, and, even more important, validity, 
one can choose between superficiality and technicality. Even with ideal 
writing, topics such as these require energetic thinking. But a student of 
psychology is not equipped to read test manuals and research reports crit- 
ically until he knows the principles included in this text. Some students will 
find parts of Chapters 4, 9, and 11 diflBcult and should be advised to post- 
pone intensive study of them. Most students will find it valuable to return to 
Chapter 4 for review after covering the later chapters. 

Factor analysis is a topic especially likely to seem difiBcult. Factor analysis 
is admittedly technical, but it is also so fundamental in the present and fu- 
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ture of psychological testing that the student cannot dodge it. The emphasis 
on factor analysis provided in this text is the minimum that the student can 
tnow if he is to understand the new tests that are on the horizon. 

One further explanation is required. The questions which stud the book 
are part of the text. They capitalize on oiu knowledge that the brain which 
works as it reads profits most. By thinking tlirough the questions on each 
section, the reader sees how the principles apply and becomes aware of 
topics which need further thought. The questions do not always have spe- 
cific answers. Frequently they are controversial, or can only be answ'ered by 
a qualified “Yes, but — .” The .student who sees two sides to any of the ques- 
tions can have confidence that he probably is doing good thinking. 

Having these purposes in mind, the writer has done what he could to ac- 
complish them. The work has been aided and improved by the careful criti- 
cisms of many friends. The writer is especially grateful to Irving Lorge, 
Walter L. Wilkins, Dorothy Knoell, and David Krathwohl for their careful 
reading and criticism of large sections of the manuscript. Many errors in 
particular chapters have been removed as a result of the helpful comments 
of Lewis M. Terman, K. J. Holzinger, and Quinn McNemar. The largest 
number of suggestions, and some of the most helpful, have come from the 
students on whom the materials were hied. Elizabeth Johnson, who took the 
major responsibility for clerical work on the manuscript, has been of great 
assistance. 

As indicated within the text, several investigators have been kind enough 
to make unpublished materials available. The writer owes an especial debt 
to O. K. Buros, who went to great trouble to provide galley proofs of The 
third mental jneasurements yearbook so that up-to-date critical comments 
could be used. Thanks are also extended to the numerous publishers who 
gave permission to quote various materials. 


Lee J. Cronbach 

Urbana, Illinois 
June, 1949 







CHAPTER 1 


I 

Who Uses Tests? 


“In the field of tests and measurements, we really have accumulated a vast amount 
of material — and I should say, reliable and valid material — on how to distinguish 
human ability. We know, for example, that we may find high mental ability at 
every socio-economic level. We know too (and vve can demonstrate this with good 
statistical material) that children differ widely not only from one another but from 
tfieir parents. We know that if we conduct a search for the kind of scientific and 
humane scholarly ability which we seek in high-school and college students, for 
example, we discover that Jefferson’s ideal is sound — we will find such ability in 
every level of population, and one cannot tell, either, simply by looking at the par- 
ents or the success of the parents. 

“Then, of course, everybody knows about what was done by social scientists in 
the two world wars. I do not see how we could have found good airplane pilots 
without our new and practical knowledge of vocational testing. 1 do not see how 
we can carry on good business and industrial enterprises from this time forward 
without using in guidance and counseling our testing knowledge. 

‘Again let us notice the practical social outcome. Let us take one very recent 
example. [The President’s] Committee on Higher Education has just reported thaL 
on the basis of adequate testing given really to millions of GTs, we probably could 
have twice as many men and women in college without diluting the mental ability 
at all. They just are not there now, not because of mental deficiency, but probably 
because of economic stringency" (2).* 

That statement, delineating a few of the accomplishments of psychological 
testing and their significance for American life, was made by G. D. Stoddard. 
The comment was made in a radio discussion, in answer to a request from 
President Conant of Harvard to name any outstanding achievement of the 
social sciences in recent decades. The testing movement stands as a prime 
example of social science in action, since it touches on vital questions in all 
phases of our life. What is character, and what sorts of children have good 
character? What personality make-up promises that an adolescent will be a 
stable, effective adult? How can we tell which six-year-olds are ready to be- 
gin learning to read? Is this young man a good prospect for training in 
watchmaking, or should he go into a different vocation — say, steamfitting or 
patternmaking? Such are the problems to which testing, and reseaich on in- 
dividual differences, dii'cct their efforts. In this book, we will survey the 
methods which have been and are being developed to solve these problems. 

One way to get a quick overview of the region we are to explore is to 
find out what testers do. By meeting a few of the people who work with tests, 

^ Boldface numbers in parentheses refer to references at the end of the chapter. 
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In contrast to tliis clinical approach, let’s take a look at the work of a jier- 
sonnel manager for a department store. This is a store with about 350 em- 
ployees, ranging from roustabouts to buyers and office personnel. Edward 
Blake, the personnel manager, is a heavy-set, graying man of forty-five, who 
seems interested in whatever we have to say to him. But there is also a brisk- 
ness, a sticking-to-a-schedule. “The routines of the job? I don’t do much test- 
ing myself; but I do interview everybody we hire. That helps the store, be- 
cause every employee knows there’s someone here in the office who has met 
him, and to whom he can take his problems. 

“When an applicant comes in, we have him fill in a blank, and my assist- 
ant, Miss Field, gives each one a set of tests. The tests aren’t quite the same 
for everybody. We give them all the Adaptability Test (a short intelligence 
test), since different jobs in the store call for employees of different caliber. 
Most applicants get a test of simple arithmetic — addition, percentages, dis- 
counts, and so on. For package wrappers and merchandise handlers, we use 
a simple test of motor ability, in which they have to place wooden cubes in 
a box as rapidly as possible. It doesn’t predict who’ll be the best employee, 
but it saves us from some lemons. For a few departments, we have trade 
tests, tests of information about the job. Some men claim to be shoe salesmen 
when they don’t know a last from a counter. These tests check on the expe- 
rience the applicant claims in his application blank. 

“Whatever tests Miss Field gives are scored and recorded on the applica- 
tion blank. Then, when there is a vacancy, we pull out of the file the names 
of people who have the qualifications tliat job requires. I call in one or sev- 
eral of these people, interview them, and if I think they’ll do, I hire them. 
The tests in our work are most useful to sort out the good and the poor pros- 
pects. Miss Field can give the tests very easily, and it saves us a lot of time 
we’d spend interviewing people who wouldn’t be good workers. Of course, 
Miss Field does a nice job, making sure each person thinks we’re interested 
in him, and sending each one away with a feeling that he’s had a fair con- 
sideration.” 

Mr. Blake, of course, is a little different from this or that other personnel 
manager we might have talked to. But his work is fairly typical of that in 
businesses having substantial tiunover. Perhaps the best contrast to him is 
Sam Garrett. 

Sam is engaged in a research program which will eventually lead to a 
method for selecting airline pilots. Garrett’s research agency is the national 
headquarters for investigations of airline personnel needs. Garrett, about 32, 
was hired as an assistant in tliis research just after he completed work for 
his doctor’s degree in psychology. Before receiving his degree, he spent a 
large part of the war in the Air Force psychological program, developing 
tests for pilots. Just now he is setting up testing centers in various cities 
where prospective pilots can be given the experimental tests. 

“Our tests are not being used for hiring yet,” Garrett explains. "We work 
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only with men who are going to be tried by the airlines. The pilots take 
our tests, and we will follow up in a year or so to see which men worked 
out well on the job. If we are right, we have a set of tests here that will 
select good pilots. But we’re not sure unless we make some predictions and 
see how they work out. This set of tests is new. It includes some of the old 
stand-bys from the experimental psychology laboratory, like this coordina- 
tion test. And many tests used during the war will be useful here. But our 
group of psychologists has been busy modifying old tests and inventing new 
ones, relying heavily on what we learned from wartime pilot selection. 
Nearly everyone in the group has a pet idea for a new type of test that 
might measure some important factor in pilot ability. 

“We decided what to include by studying what makes a good pilot. Wc 
ihterviewed hundreds of pilots, CAA inspectors, and check pilots. We had 
tiiem tell what caused accidents they knew of, what mistakes they had 
made, what things they knew made some men poor pilots. We also exam- 
ined previous research on abilities. From the combined evidence, we got a 
list of the characteristics that seem to be most important in piloting. Then 
we found or designed tests which seemed to measure each characteristic. 
Our present list of ‘critical requirements of the airline pilot’s job’ is (1 ) : 

" ‘1. Judgment and Planning. The ability to reason logically, to anticipate, and 
to use good judgment in practical situations. 

“ ‘2. Pilot Information, The possession of adequate knowledge of aero-equipment, 
principles of flight, instrument flying, navigation, and weather. 

“ ‘3. Mechanical Rdations. The ability to visualize mechanical movements and 
to have insight into the operating principles of mechanical devices. 

“ ‘4. Visualization of Movement. The ability to visualize movements in the at- 
titude of an object and to visutdize the flight path of an airplane through 
space. 

“ ‘5. Instrument Reading. The ability to obtain data from instrument dials and 
scales quickly and accurately. 

“ ‘6. Instrument Inter^yretation. The ability to determine the attitude of an air- 
plane from interpretation of the readings of insbuments. 

“ ‘7. Signal Reaction Time. The ability to perceive a situation, decide quickly on 
the appropriate response, and make the correct movement. 

“ ‘8. Coordination. The ability to make smooth and precise movements. 

“ ‘9. Orientation. The ability to recognize landmarks from maps or photographs 
and to orient with reference to fixed positions in space. 

“ 10. Personal Stability. The ability to remain organized and to make appropriate 
responses to adverse or unusual situations. 

“ ‘11. Attitudes and Interests. The possession of personal attitudes and interests 
appropriate for an effective airline pilot. 

“The tests? Most of these things can be measured by printed tests — ^rea- 
soning questions, questions of knowledge, tests showing pictures of the in- 
sti-ument panel and asking for interpretations. The mechanical relations test 
shows pictures of machinery and asks the man, for instance, how a wheel 
will turn when a certain gear moves. We have a few apparatus tests, which 
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measure how well the applicant coordinates. Light signals indicate that he 
is to move certain controls, and the apparatus records how rapidly he re- 
sponds. Sometimes he has to make several actions at once, which is pretty 
confusing to a man with poor aptitude and self-control. In addition, we ex- 
amine the mans background, and interview him. 

“These methods probably aren’t the last word in pilot selection. We’ll have 
to change some of these tests, and we’ll have new ideas next year which will 
give us additional tests. But the research is progressing. The principal tiling 
we’re sure of now is that this job is too complicated to be predicted by just 
a single mental test of some sort. But with the right tests it can be predicted 
well enough that we should reduce crashes due to pilot error. Besides, the 
airlines will save a great deal of money if we can identify men likely to fail 
in training, since training a pilot is expensive.” ' 

That’s the picture of testing in action. It doesn’t yet cover the clinical 
psychologist in a hospital, the experimental psychologist studying a theoreti- 
cal problem, the tester preparing standardized tests for school use, the vo- 
cational counselor, and many others. We haven’t paid much attention to 
the Miss Fields who give most of the tests, in offices, clinics, schools, and 
industries. But from the portraits just presented we can draw some generali- 
zations which warrant spending time learning about tests: 

1. Tests play an important part in making decisions about people. 

2. There is a great variety of tests, covering many sorts of characteristics. 

3. Even for a single characteristic, such as mental ability, there are many 
tests which have different uses. 

4. The significance of test scores is greatest when they are combined with a 
full study of the person, by means of interviewing, case history, applica- 
tion blanks, and other methods. Tests provide facts which help us under- 
stand people. They almost never are a mechanical tool which can make 
decisions for us. 
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CHAPTER 2 


Purposes and Types of Tests 

FUNCTIONS OF TESTING 

Almost any task or action some men can and will do more skillfully than 
others. Since these individual differences are a principal problem in directing 
the activities of others, every leader, manager, teacher, social worker, and 
physician must be able to identify them. The purp ose of e very-psychological 
test isJto detect differences between individuals. This knowledge serves two, 
functions; prediction and diagnosis. 

Predicti on 

An attempt to predict underlies every use of testing. Whenever a test is 
given to two people, it tells about some difference between their perform- 
ances at this moment, But the fact would be of no significance, would not 
be worth knowing, if from it one could not predict that these two people 
would differ in some future activity. Consider a test as abstract as flash rec- 
ognition of letters. Some people recognize four letters in a brief instant; 
others react to five. This difference intrigues the curiosity, but it is an un- 
important fact until it is related to some behavior it predicts. The psy- 
chologist interested in applications may be concerned with it because this 
behavior has much in common with airplane recognition and perception in 
reading. The theoretical psychologist is less concerned with immediate ap- 
plication of laboratory findings, but he too will consider individual differ- 
ences in a predictive sense by inducing from them general laws about 
perception and recognition applicable to the future behavior of the indi- 
vidual. 

Prediction is illustrated by clinical use of tests. A clinician gives a test 
of emotional adjustment to a client and finds that his score is far from the 
norm. Giving the test and using the results are based upon the assumption 
that atypical answers to the test predict that he will behave abnormally at 
some time in the future, unless corrective measures are taken. Prediction 
nearly always refers to some eventual overt behavior. The clinician would 
not be set to detecting emotional maladjustment if that were only some 
internal state of the subject, never to crop out. Tire significance of the entire 
problem hinges on the fact that answers given on the test permit one to 
predict deviant behavior which should be forestalled. 

Prediction is the ultimate justification for the achievement test used for 
grading in school. It is used for prediction in two ways, although botli 

9 


10 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


teacher and pupil may overlook this fact in their concern with the grade it- 
self. The grade is warranted primarily because it predicts the pupil’s ability 
to carry on some future activity; if it does not predict that, grading is likely 
not to be worth the troubles it causes. Furthermore, test results should give 
the teacher a prediction that if she teaches another class by similar methods, 
she can anticipate similar results of instmetion. Without its predictive func- 
tion, grading would be an act of cr}'ing over spilt milk. 

Prediction is most obvious in selection, when one chooses among many 
candidates for an assignment. Military selection has employed tests exten- 
sively at every stage of the induction and training procedure. Out of a vast 
number of reci-uits, tests quickly discard those who will be unlikely to adapt 
to die service, and identify the upper groups to be considered for Gaining as 
specialists or officers. A fm-ther selection takes place at the end of training, 
when some men are passed for assignment to duty and odiers are failed and 
given less responsible work. Tests hear a major responsibility in military 
screening, since it is impossible to give personal attention to each individual, 
and decisions must be made rapidly and with finality. Industrial selection 
bears many of the characteristics of military processing. 

1. Demonstrate that prediction is attempted in each of the following situations. 

a. A foromun is adeed to rate his woncers on accuracy and quality of work. 

b. Airlines require a periodic physical examination of pilots. 

c. A psychologist determines whether students are more “liberal” in their atti- 
tudes toward birth control after two years of college study. 

d. James receives a grade of C in algebra while Harry receives an A. 

Diagnosis 

Prediction emphasizES-^ifferanGfe^ctwefenundi viduals,, or betw een an in- 
, dividual an^ s ome standard. A second function of testing, diagnosis, empha- 
sizes differences among characteristics of the same individual. Diagnosis in 
guidance and individual case work identifies the particular strengths of the 
individual so that he may capitalize on them, or the particular weaknesses so 
that he may adapt to or coirect tliem. The clinical psychologist diagnoses 
when he identifies the precise difficulty in a case of maladjustment. The 
school psychologist or counselor diagnoses cases of scholastic failure to de- 
termine appropriate action. Diagnosis involves prediction, since the test user 
must decide what behavior the present pattern of characteristics permits and 
how stable that pattern will remain. 

The terniJjdiagiiQSis,’l-asjii5ed in this book, refers to the analysis and de- 
scriptipn.of.-yaijious aspects of behayior. This should be distinguisHed from 
the more restricted meaning of diagnosis as the process of placing the sub- 
' ject in a particular category, such as schizophrenic or psychopathic. 

The principal difference between the diagnostic and predictive emphases 
in testing is that in diagnosis we must be more certain what a test score 
means. If we know that Test X predicts a given behavior, and tliat case 771 



PURPOSES AND TYPES OF TESTS 11 

has a low score on Test X, we need know nothing about the reason for his 
low score in order to predict poor performance. But if diagnosis is sought, it 
is important to know what characteristics cause a low score on Test X, since 
the emphasis in diagnosis is on explaining the facts about one’s case. If one 
knows what is wrong, one can plan a remedy, or soundly predict tliat the 
problem is incurable. 

2. Describe one circumstance when a diagnostic approach would be used by each 
of the following persons. 

a. A girls’ counselor in a college. 

b. An employment manager. 

c. A social worker dealing with children. 

d. A teacher of typewriting. 

3. jShow that each illustration given in answer to Question 2 involves attempted 
prediction. 


Research 

Attention should be drawn to research as a function of testing, even 
though research involves prediction and occasionally diagnosis. In the search 
for knowledge about human behavior, many questions arise which have only 
remote practical importance but which deal with fundamentals of biological 
or social science. Tests have frequently been designed to solve research 
problems, even though these tests have no everyday utility. One example 
of such a test was developed by Lewis. He played phonograph recordings of 
words backwards, in order to study tlie way people learn to recognize these 
strange sound patterns (3). Such tasks, just because they are novel, fre- 
quently make very good experimental test situations. Another type of re- 
search makes use of common tests but seeks scientific knowledge rather than 
application; an example is the study of differences between racial groups, in 
which psychological tests have been prominent. 

WHAT IS A TEST? 

A test may be defined as a systematic procedure for comparing the be- 
havior of two ox more-fevsons. 'This is a broad definition, broader than that 
used by some other writers. The layman is apt to think of a test as a series 
of questions requiring a written or oral answer. Psychological tests are, how- 
ever, extremely varied, and the range of stimuli adapted to testing is con- 
stantly growing. 

The definition given here rules out one large class of procedures from con- 
sideration as tests. Interviewing is an important method of case study, but 
rarely is every individual ti-eated in exactly the same way, and the emphasis 
is more upon study of the individual than upon comparisons. Occasionally 
standardized interviews have been used, in which a uniform set of questions 
is asked of all persons. Some of these interviews qualify as tests under the 
definition given above. 
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One would not consider as a test an informal method of forming an opinion 
about the individual, such as casual conversation. Even if a teacher calls on 
each member of a class in turn, asking questions about an assignment, such a 
procedure cannot be considered a test unless the questions are set and the 
performance of one pupil can be compared with normal expectation on the 
question he is asked. For a study of behavior to be a test, every person must 
be observed in the same way as tliose with whom he is compared. It is not 
necessary that results be reduced to scores. Some useful tests give verbal 
descriptions of the person rather than numerical scores. 

The definition of testing suggested permits us to consider measurements 
using apparatus, systematic procedures for studying frustration, sets of ques- 
tions for obtaining reports on personality, and collected records of perform- 
ance in life situations. Thus a record of the production of a worker over a 
period of two weeks might be considered a test, if comparable records may 
be obtained for other workers. This broad definition is adopted because the 
principles of test constiuction, use, and interpretation to be studied in later 
chapters apply equally to all these types of measurement. Validity, objec- 
tivity, standardization, standard .scores, and the like are broadly applicable 
concepts. It appears simpler to refer to all standard procedures as tests than 
to draw careful distinctions between tests, questionnaires, work samples, and 
observations. The reader is warned, however, that many definitions for “test” 
are in current use, so that it is important to use only one meaning in any 
conte.xt. 

Tests, in this broad conception, need not be considered as measurements. 
Measurement attempts to assign a number, on a scale of equal units, to each 
person for each characteristic. Not only are psychological tests less perfect 
as measurements than measures used in other sciences, but many useful tests 
do not measure. One way to study a thing is to measure, but an equally help- 
ful way is to describe it accurately. Tests of personality in particular often 
seek to describe a person rather than to reduce his behavior to a few scores. 
The measurement emphasis in psychology stems from two famous Thorn- 
dike dicta: “If a thing exists, it exists in some amount”; “If it exists in some 
amount, it can be measured.” While these postulates set an excellent goal 
for psychology as an exact science, they have caused non-quantitative tech- 
niques to be slighted. Psychology must at present use both measurement 
and description. As in botany, geography, meteorology, and chemistry, de- 
scriptive methods may some day decline in emphasis as refined measure- 
ment can cope with more of our problems. It is likely, however, tliat some 
aspects of behavior will always remain beyond the reach of the measuring 
instrument. 

4. Decide whether each of the following is a test in the sense defined above. 

a. A mark for a course is based on the average of the six tests given during a 
course. 

b. An anthropologist measures the skull dimensions of subjects. 
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0 . An inspector checks samples of the work produced by each punch-press 
operator and gives him a score on the quality of his work. 

d. A nursery school tesicher rates children on sociability after observing the 
number of contacts each child makes with other children on the play^’ound. 

e. Judgments on the physical skill of schoolboys are obtained by asl^g the 
coach to report his opinion of each. 

f. Foremen who have been given a course in industrial relations are asked to 
write out how they would solve a series of hypothetical grievance cases. 

5. Compare the tollowing definitions with that given in this text. What procedures 

if any are called tests in this text, that are excluded under each of the following 

definitions? 

a. “A situation in which a record of performance is secured directly, not by 
estimate” (2, p. 776). 

b. “The task, together with the means of appraising it, which defines an ability” 

c (5,p. 62). 

c. “A psychological test is too often thought of simply as a measuring device. 
It is this and more. It is a standard situation in which to observe behavior” 
(l,p.228). 


CLASSIFICATION OF TESTS 

There are numerous ways of classifying tests — accordi ng t o form, p urpo se, 
coa ten t, and other characteristics.TBe simplest and most applicable group- 
ing distinguishes between tests seeking to measure the ma ximum p erform- 
ance of which the subject is capable and tlrose seeking to determine his typi- 
cal behavior. 


Tests of Ability 

Tests of maximum perfo rmance are. required Jn studyingjability-and- eft- 
padt^ ZPflpgcif t| |, Is th e person’s hypothetical potentiality for acquiring a 
reaction,' with training. One can never know what capacity is. Even if a per- 
son has reached a certain level after training, we cannot be sure that he 
would not do better with better training or motivation. Ability is more tan- 
gible; it may be defined as the person’s performance o iTa ta^ at present, 
with maximuin motivation but without further training. Ability is irever 
greater than capacity; how closely it approximates capacity depends upon 
the extent to which the subject’s potentialities have been developed. Some- 
times, too, emotional interference makes ability less than it could be, as in 
the case of stage fright. 

Tests of maximum performance are tests of ability. Tests of ability to de- 
fine words (vocabulary tests) have been used to predict school performance. 
Tests of ability to answer questions about a job (trade tests) have been used 
to distinguish between expert and less experienced workers. Tests of ability 
to respond instantly to a signal ( reaction-time tests ) have been used to weed 
out applicants who would make unsafe drivers. In each of these cases the 
tester seeks a measure of the best performance the person can produce. If 
this is obtained, a person who makes a poor score is known to be unsuitable 
for the purpose at hand, at least rmtil he is given further training or treat- 
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merit. As will be seen in Chapters 4 and 5, a principal problem with tests of 
this type is to make certain that the subject is doing his best. A good score 
does not guarantee, of course, that the person will be equally superior under 
everyday motivation. 

Tests of Habitual Characteristics 

Tests of typical performance are used in studying habits and personality. 
There is little value in determining how courteous an applicant for employ- 
ment in a store could be when she wanted to; almost anyone of normal up- 
bringing has the ability to be polite. But the test of a suitable employee is 
whether she maintains that courtesy in her daily work, even when she is not 
specially motivated or “on her best behavior.” An inspector with proper vi- 
sion and training should be able to detect defective parts. A test of his care- 
fulness which determines how well he judges parts when hying especially 
hard would have little additional value. The difEerence between the good 
and poor inspector ( of equal aptiUide and training ) is tlie extent to which 
the latter permits himself to be distracted and careless in run-of-the-mill 
duty. For cheerfulness, honesty, open-mindedness, emotional stability, and 
many other aspects of behavior there is almost no practical value in a test 
of ability, since most people can produce a show of the behavior when it is 
demanded of tliem. But persons who can act cheerfully, honestly, or impar- 
tially when they know they are being tested may not do so in other situa- 
tions. The difference between the test of typical behavior and the test of 
maximum ability is that in the latter case the subject knows what is judged 
good and seeks to show good performance. 

In many situations in which typical performance is studied the concept 
of “good” or ‘Tjest” performance is inappropriate. Personality is usually an- 
alyzed in a descriptive way, with no attempt being made to consider one 
characteristic as ideal. Interest tests show this clearly. There is nothing good 
or bad about interest in, say, engineering. One who has this interest can use 
it, but one who has not finds other worth-while activities. In social relations, 
people show wide variations in dominance-submission, but one cannot say 
that any proportion of these qualities is best, since our world has places for 
all degrees of this behavior. 

Testing of typical performance is difficult. It has been accomplished, with 
greater or less success, in a variety of ways. These metliods may be divided 
into behavior observations and self-report devices. 

Behavior observations are attempts to study the subject in action. Prefer- 
ably he is unaware that he is observed; it is hoped that under such condi- 
tions a helpful sample of his typical behavior can be described. Behavior 
observations include both observations in standardized test situations and 
observations under “natural” conditions. 

Test situations are usually employed for observing an aspect of behavior 
which could be seen only occasionally in everyday life. This technique has 
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been used to determine typical reactions to frustrations. The person is 
started on a task and prevented in some way from succeeding at it. He may 
then react in many significant ways which give insight into his personality 
and emotional control. It is essential for the validity of test observations that 
the subject not know what characteristic is being observed. Sometimes the 
observer is concealed. At other times the subject is led to believe he is being 
tested on one behavior, when another is being observed. Reaction to frustra- 
tion is frequently studied when the subject believes he is being tested for 
mental ability. His responses when frustrated by difiBcult questions are usu- 
ally genuine and little disguised (4). 

Field observations, where the subject acts in his normal situation, may 
call for elaborately standardized observations, or merely for ratings or judg- 
ments by persons who have had a good opportunity to observe the subject. 
Field observations are used in industry to study the production habits of 
workers, in preschools and on playgrounds to study social reactions of chil- 
dren, and in a variety of other situations. In baseball, for example, the bat- 
ting average is essentially a score based on field observations, recorded sys- 
tematically. 

The self-report approach is based on the fact that the subject has had an 
unusually good opportunity to observe himself in a variety of situations, so 
that he can, if he wishes, give a helpful estimate of his behavior. Self-report 
tests consist of standardized questions for obtaining such estimates. The cru- 
cial problem in using self-reports is obviously that of obtaining honest anal- 
ysis. If the person tries to give a “best” picture of himself instead of a typical 
one, the test will fail of its purpose. An added problem is that his view of 
himself is likely to be distorted to some degree, since one cannot be a truly 
detached and impartial observer of himself. 

6. Classify each of the following procedures as a test of ability, a self-report test, 

an observation in a test situation, or a field observation. 

a. An interviewer from the Gallup poll asks a citizen how he will vote in a 
coming election. 

b. A broadcasting company wishes to know what program features appeal 'tp 
different types of listeners. It presents a broadcast to a studio audience, who 
react to the broadcast by pushing signal buttons to indicate when they enjoy 
or dislike a part of the program. 

c. A test of "vocational aptitude” asks the subject how well he likes such ac- 
tivities as selling, poker, woodworking, and chess. 

d. An arithmetic test is given to a group of applicants for jobs as cashiers. 

e. Inspectors in plain clothes ride streetcars to determine whether operators 
are pocketing fares without “ringing them up." 

f. During an intelligence test, the examiner is alert for evidence of self-confi- 
dence or lack of self-confidence. 

g. In a test of “application of principles of science,” students are told about a 
fire which breaks out when an oil stove is turned over, and asked to decide 
how to put out the fire and to tell why it should be done that way. 

h. In a test of “application of principles in social studies,” students are told of 
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a conflict about admitting Negroes to a housing project, asked to state what 
they think the city council should do and to give reasons to support the 
choice. 


Terms Used to Describe Tests 

In addition to the classification just discussed, there are many bases for 
grouping tests to call attention to particular characteristics. Certain basic 
terms are used to describe tests, although not every writer uses these terms 
in the same way. 

Objective tests are often distinguished from subjective tests. Objective 
procedures are those in which responses are scored or summarized in such 
a way that all judges would agree as to the score assigned or the analysis 
made for each person. True-false tests, arithmetic tests, and measures “of 
speed of tapping are representative objective procedures. In subjective tests 
the opinions of the scorers are permitted to affect the score assigned. Typi- 
cal essay examinations, ratings of quality of art or sewing, and estimates of 
personality characteristics from observation are subjective, at least to some 
degree. 

Standardized tests are those in which the test procedure and content have 
been so fixed that subjects at different places and times may be compared. 
A standardized test usually but not always has norms, or standards of nor- 
mal performance, for use in interpreting results. Standardized tests may be 
contrasted to infornml tests, such as those made up by a teacher to test a 
particular block of assignments. 

Performance tests are distinguished from pencil-and-paper tests. In a per- 
formance test the subject is required to demonstrate his skill by manipulating 
objects or apparatus. Among the performance tests which have been used 
for various purposes are assembling a jigsaw puzzle, conshucting a bicycle 
bell from its parts, “inventing” a hat rack out of two sticks and a C clamp, 
and “sewing” a zigzag line on a card with a threadless sewing machine. 
Pencil-and-paper tests are those in which the subject answers questions by 
writing or drawing. The questions may be verbal or pictorial. A third variety, 
oral tests, should also be mentioned. 

Group tests differ from individual tests in that group tests permit many 
subjects to be tested at once. This distinction refers more to the way in 
which tire tests are commonly used than to differences in their structure. 
Group tests can be given to a single individual if that is desirable. Many 
individual tests require careful oral questioning or observation of reactions, 
in which case a tester cannot work with several subjects simultaneously. But 
some individual tests have been modified and simplified so that group ad- 
ministration is also possible. An example is the Borschach test of personality. 
In the individual form of the test, a subject looks at a card bearing an ink- 
blot, and states orally what the blot looks like to him. He is questioned about 
each response until the tester is sure just what the subject sees. In the group 
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form, the blots are placed on slides and projected on a screen. Subjects write 
their responses, and individual cpiestioning is omitted. 

Speed tests are contrasted to power tests of ability. In speed tests (also 
called speeded or time-limit tests), the subject is required to complete as 
many problems or tasks as he can in a predetennined time. Some speed tests 
measure nearly pure speed, since the task is one the subject could surely 
solve if given enough time. Examples are the test, “Go over this page, cross- 
ing out as many a s as you can before I call time,” or the test, “In this row 
of circles, place just one dot in each circle; make as many dots as you can 
before I call time. Power tests, also known as unspeeded or work-limit tests, 
present a set of problems of increasing difficulty, the subject being required 
to do as many problems as he can. A test may measure power and speed in 
any’mixture, depending on the difficulty of the items. To determine if speed 
is important, the tester must obtain data on the question: “Did the person 
have time to finish all or nearly all of the items he could hav'e answered with 
unlimited time?” 

Tests are also subdivided according to purpose. P.sychological tests of 
maximum performance are referred to as mental tests, intelligence tests, 
aptitude tests, prognostic tests, tests of special abilities, psychomotor tests, 
achievement tests, etc. Formal definitions of these terms are not given here, 
but each type is teeated at length in later chapters. Mental tests are tests of 
ability to perform mental tasks. The term “intelligence test,” while not rigidly 
used, usually refers to procedures attempting to measure the general, over- 
all mental ability of the individual. Aptitude and prognostic tests predict 
performance or learning in a new situation; one might have an aptitude te.st 
for dentistry, or for carpentry, or for music. Special-ability tests attempt to 
measure some limited performance such as numerical reasoning or color 
vision. Psychomotor tests measure performance of adaptive physical activi- 
ties; tests of coordination in skilled acts, speed of complex reactions, and 
steadiness are frequently used. Achievement tests are used to determine how 
much a person has learned from some experience such as schooling. 

Because some writers prefer to restrict the term "test” to measures of 
ability, numerous other words are used to refer to devices for estimating 
typieal performance. Self-report devices in particular are referred to by 
many names: “inventory,” “questionnaire,” “opinionaire,” "checklist,” “rec- 
ord,” etc. These procedures have usually been thought of as “personality 
tests,” and tlie various measures of personality have been classified as ad- 
justment tests, character tests, attitude tests, interest tests, etc. Among per- 
sonality tests, special interest attaches to the adjustment test. Adjustment 
tests attempt to measure the degree to which the subject has emotional con- 
flicts, tensions, and unsatisfied needs. Tliey are used in investigations of the 
mental hygiene of individuals or groups. Character tests are used to measure 
traits upon which society places a moral valuation, such as stealing, lying, 
and cheating. Attitude tests study the person’s opinion regarding political 



18 


ESSENTIALS OF PSYCHOLOGICAL TESTING 

and social policies and institutions; there are tests of attitude toward war, 
progressive education, Russia, one’s job, and one’s present teacher. Interest 
tests are similar to attitude tests, but stress liking for various sorts of activi- 
tie.s. 

Although some scheme of classifying tests is a convenience, all such divi- 
sions are arbitrary. One of the striking current trends is the breakdown of 
ti'aditional division lines. As will be seen in later chapters, intelligence tests 
are being used to diagnose personality, and some of the newest tests of per- 
sonality give useful estimates of intelligence. Even the distinction around 
which this book is organized, between tests of maximum and typical per- 
foimance, is difficult to sustain for a few borderline tests. 



Year V Adult I 


Fig. 1. Two of the Porteus mazes. 

7. Judge each of the.se statements true or false and defend your answer. 

a. Batting averages are objectively determined. 

b. The 220-yard low hurdle race is a standardized test. 

c. A typing test which requires the typist to copy from a printed page as rap- 
idly as possible is a ‘performance test. 

d. A spelling test, dictated by the teacher and written by the pupil, is a 
pencil-ana-paper test. 

e. A teacher has each member of the class read the same article in a current 
magazine. Time is called at the end of three minutes, and each pupil marks 
the place where he is reading. He then counts the number of words read 
and computes his reading rate in words-per-minute. This score is compared 
with a table of average reading speeds for typical magazine articles. This 
test is objective. 
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8. Classify each of the following tests, using as many of the descriptive terms dis- 
cussed in the text as are clearly applicable. 

a. The Humm-Wadsworth Temperament Scale consists of several hundred 
printed questions, such as “Do you like to bet?” The sulrject answers each 
question by checking “yes” or “no.” Answers are scored to indicate the ex- 
tent to which he shows various trends in temperament. 

b. The Porteus Maze Test requires the child to trace the correct path through 
each of a series of 
printed mazes. The 
first mazes are very 
easy, with succes- 
sive ones becoming 
more difficult. The 
child continues un- 
til he fails several 
trials at a particu- 
lar maze. A failure 
is judged when he 
enters a blind al- 
ley. 

c. In the Stenquist 
test booklet the 
subject marks il- 
lusti'ations of tools 
and mechanical ob- 
jects to show which 
go together. 

d. The Bennett mechanical comprehension test presents questions about 
pictorial problems. A typical item shows two wagons poised on a stair- 
way; the subject is to decide which wagon will roll down because of 
poor balance. Testing time is unlimited. 

e. In the Tweezer Dexterity Test the subject places pins in holes on a special 
board as rapidly as possible. 

9. Would speed or power tests be most important to measure the skill of tlie fol- 
lowing? 

a. typists. 

b. radio repairmen. 

c. dentists. 

10. Can a subjective test be a standardized test? 

11. Can an achievement test be an aptitude te.st? 

12. Can a performance test be a group test? 
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Fig. 2. The Tweezer Dexterity Test. (Photograph 
courtesy C. H. Stoelting Co.) 



CHAPTER 3 


Interpreting Test Scores 

RAW SCORES AND DISTRIBUTIONS 
Meaning of a Raw Score 

Most tests yield a numerical report of a Iverson’s performance. This may be 
reported in terms of tlie number of questions he answered, the time he re- 
quired for the task, or some similar number. This number is his raw score. 
Because raw scores are readily available, and familiar from long experience 
in classroom tests, many people interpret them without realizing their limi- 
tations. A familiar example from the old-fashioned report card will demon- 
stTcate the problem. 

Willie brings home a report showing that his average in arithmetic was 
75, and his average in spelling was 90. His parents can be counted on to 
praise the latter and disapprove the former. Willie might quite properly 
protest, “But you shoulda seen what the other kids got in arithmetic. Lots 
of them got 60 and 65.” The parents, who know a good grade when they see 
one, refuse to be convinced by such irrelevance. What do Willie’s grades 
mean? It might appear that he has mastered three-fourths of the course work 
in arithmetic, and nine-tenths in spelling. But Willie objects to that, too, “I 
learned all my combinations, but he doesn’t ask much about those. Tire tests 
are full of those word problems, and we only studied them a little.” Willie 
evidently gets 75 percent of the questions asked, but if the questions are 
hard, the percentage itself is meaningless. We cannot compare Willie with 
his sister Sue, whose teacher in another grade gives much easier tests so that 
Sue brings home a proud 88 in arithmetic. It could be, too, that Willie’s 
shining 90 in spelling is misleading. If the spelling tests deal with the very 
words assigned for study each day, and everyone but Willie earns 95 to 100, 
90 is not so ideal. 

Raw scores can indicate nothing of generalized significance until tliey are 
compared with some standard. Scores may be said to have an absolute mean- 
ing and a relative meaning. If a high jumper can cross the bar at 6 feet 3 
inches, we have an absolute measure of what he can do. His performance 
takes on more meaning relative to the norm for his age, the performance of 
others in his group, or the world’s record for liigh jumping. In mental varia- 
bles, absolute meanings are uncommon. Willie’s ability to spell 90 percent 
of his spelling words is an absolute and meaningful report— of his ability to 
spell those words. But it tells us nothing about his ability to spell words in 
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general unless we know how common or diflBcult the test words were. If we 
standardize the list of words so that they are representative of the words he 
needs to spell in daily life, the report that he can spell 90 percent of them 
becomes an absolute measure of his ability to spell in general. 

For an absolute measure one must have as a standard of reference some 
true zero of performance. Suppose Willie had earned a score of 10 percent 
in spelling. Does this mean that he knows only one-tentli of the words he 
should? No, for the teacher probably did not ask about the easy words, that 
Willie was sure to know. Even a score of zero on the test would not mean 
zero on knowledge of spelling-in-geiieral. The difference between Willie 
with a score of zero and the model pupil who earns 100 is perhaps a differ- 
ence in ability to spell only 20 words out of an active vocabulary of several 
thhusand — if those 20 constituted the test. This interpretation is much more 
favorable to Willie. The same difBculty is less readily recognized in tests of 
intelligence. A raw score of 80 may seem to represent ability twice as great 
as the raw score of 40. Since the test does not include the items everyone can 
answer, perhaps the true ratio of performance, if every possible item had 
been used, would be 140 to 180, or 1040 to 1080. Even an infant, with his 
ability to recognize food and his mother, shows the presence of more than 
zero intelligence. Absolute zero, in any test of ability, is “just no ability at 
all.” 

1. Decide whether each of the following statements has absolute meaning — ^that is, 

whether one can make a meaningful interpretation of it without further data. 

a. Sarah can type 115 words per minute on typical business copy. 

b. Fred reads at the rate of 300 words per minute in the average college text- 
book. 

c. Morris earned a score of 84 on this mental test. 

2. How could one design a test which would give an absolute measure of the num- 
ber of words a person knows, without asking about every word in the language? 

3. Decide whether an absolute zero exists for each of the following variables, and, 

where possible, define it. 

a. Height. 

b. Ability to discriminate between the pitches of tones. 

c. Speed of tapping. 

d. Gregariousness, seeking the companionship of others. 

e. Rifie aiming. 

Differences in raw scores do not represent absolute distances between in- 
dividuals. Suppose Harry gets 110 points, Sam gets 120, and Jeff gets 130. 
Is Jeff as different from Sam as Sam is from Harry? Not at all, since the 
meaning of tliis difference depends upon the items in the test and the group 
studied. Any difference between raw scores can be stretched or reduced by 
inti'oducing more or fewer items that discriminate between the persons in- 
volved. Suppose we add to the test ten items that Jeff can pass, but Sam and 
Harry cannot; the difference between Jeff and Sam has been stretched to 20 
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points. While absolute differences can be changed in this way, the ranks of 
the three men remain the same. 

For this reason, raw scores are frequently converted into ranks, a kind of 
relative score, before use. Willie’s report card ( 90 in spelling, 75 in English, 
75 in aritlimetic, and 86 in drawing ) presents a very different picture if it 
reports his rank among 30 children ( 8 in spelling, 12 in English, 23 in arith- 
metic, and 23 in drawing). 

4. If several pupils in Willie’s class move away and are replaced with newcomers, 
will his raw scores probably change? His ranks? 

5. If a different set of test questions had been used in English, would Willie’s raw 
score change? His rank? 

6. Alfred, a college freshman, is to receive guidance on his academic plans, and is 
given four tests of ability. Scores are presented in four different ways. Consider 
separately each row of scores, and interpret each set. 



Vocabulary 

Verbal 

Reasoning 

Non-Verbal 

Reasoning 

Performance 

Raw score 

116 

32 

44 

20 

Percent of possible 

77 

73 

80 

67 

Points above average 

24 

10 

20 

0 

Rank among 260 freshmen 

104 

113 

41 

136 


An Illustrative Mental Test 

Before considering procedures for more systematic treatment of scores, let 
us examine a simple mental test. The Kent EGY (“emergency”) Test is a 
short intelligence test, used when a quick estimate of ability is desired. The 
following questions are presented orally to each individual ( 5 ) : 

1. What are houses made of? (Any materials you can think of) 

One point for each item, up to four. 

2. What is sand used for? 

Four points for manufacture of glass. Two points for mixing with concrete, 
road building, or other constructive use. One point for play or scrubbing. 
Credits not cumulative. 

3. If the flag floats to the south, from what direction is the wind? 

Three points for north, no partial credits. It is permissible to say: ‘Which 
way is the wind coming from?’ 

4. Tell me the names of some fishes. 

One point each, up to four. If the subject stops with one, encourage him 
to go on. 

5. At what time of day is your shadow shortest? 

Noon, three points. If correct response is suspfected of being a guess, in- 
quire why. 

6. Give the names of some large cities. 

One point for each, up to four. When any state is named as a city, no 
credit for New York unless specified as New York City. No credit for home 
tovra except when it is an outstanding city. 

7. Why does the moon look larger than the stars? 

Make it clear that the question refers to any particular star, and give as- 
surance that the moon is actually smaller than any star. Encourage the 
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subject to guess. Two points for “Moon is lower down.” Three points for 
“nearer” or “closer.” Four points for generalized statement that nearer ob- 
ject looks larger than more distant object. 

8. What metal is attracted by a magnet? 

Four points for iron, two for steel. 

9. If your shadow points to the northeast, where is the siin? 

Four points for southwest, no partial credits. 

10. How many .stripes in the flag? 

Thirteen, tsvo points. A subject who responds 48 may be permitted to cor- 
rect his mistake. Explain, if necessary, that the white .stripes are included 
as well as the red ones. 

In this test, instead of one point for each question right, credits range from 
one to four points, depending on the quality of each an.swer. The total pos- 
sible credit is 36 points. A person with a normal fund of information who can 
do simple thinking should do well on the test, but a mentally deficient person 
would earn a low score. 

In the following sections we shall use a set of hypothetical scores on this 
test to illustrate the possible statistical procedures in test interpretation. Sup- 
pose that Mr. Gates, a psychologist attached to a criminal court, gives this 
test to prisoners to obtain a quick preliminary estimate of their abilit)'. Men 
earn different scores, and soon Mr. Gates has a set of 45 scores, ranging from 
15 points to 32 points. Mr. Gates c-ould interpret each score by using a table 
of “norms” .showing how men usually perform on the test. But if he has no 
such table, or wishes to compare each prisoner with the entire prisoner 
group, he will use some of the statistical procedures described below. 

Distribution of Scores 

In order to compare one man with a group, we need a simple method of 
summarizing the performance of the group. The usual first step is to make 
a frequency distribution by tallying how many people earn each score. In 
Computing Guide 1, the raw scores for Mr. Gates’ prisoners are listed, and 
the procedure of tallying is illustrated. When a small group of scores is 
tallied, the distribution is irregular, but when large groups are studied, 
smoother frequency curves are obtained. It is convenient to smooth the 
distribution of scores, since minor irregularities are usually of no importance. 

Figure 3 shows several frequency distributions in grapliic form. These 
frequency distributions, like most distributions of test scores, are practically 
continuous. In everyday thinking, we speak of “bright,” “average,” and 
“dull” groups, but when ability is measured, these groups blend together 
and can only be distinguished by some arbitrary division. Suppose Mr. Gates 
calls men scoring 28 or over “superior” and speaks of men below 28 as an 
“inferior” group. Men on both sides of the border line are much nearer each 
other in ability than men having scores of 28 and 32, even though both the 
latter are classified as superior. Any attempt to classify normal subjects into 
types is only a crude way of reporting the actual individual differences. Dis- 
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tinctions of this sort are often made for practical purposes. Typists who can 
perform at a rate of 60 words per minute are accepted for employment while 
those at 58 are rejected. Soldiers who can meet fourth-grade standards are 
accepted as normal while those below this level are sent to classes for illit- 
erates. An especially common situation in which the true continuity of life 
is replaced by dividing lines is the five-point grading system used through- 
out our schools. Classifications of this sort are artificial, no matter how help- 
ful they may be. Other categories invented for convenient discussion are 
represented by the words “normal,” “feeble-minded,” “maladjusted,” “neu- 
rotic,” and “introvert.” In every trait that has been studied, scores of people 



(c) Normal Distribution (d) Skewed Distribution (e) Bimodal Distribution 

Fig. 3. Types of frequency distribution. 


chosen at random are found to scatter gradually from one extreme to the 
other. 

The normal distribution is a basic concept in all statistical analysis, and 
helpful in interpreting most tests. A normal distribution is the bell-shaped 
curve illustrated in Figure 3(c). It is characterized by the concentration of 
scores near the average, and tlie symmetrical tapering toward each extreme. 
Normal distributions are generally found in measurements of biological 
characteristics, such as height and strength, of people chosen at random. 
Many mental tests also yield normal distributions. 

A distribution which is not normal may be obtained when studying a se- 
lected group, even with a test which gave a normal distribution for an un- 
selected group. Suppose curve (c). Figure 3, represents the distribution of 
mental test scores among unselected children. If the same test were given in 
a private school where only superior children are admitted, the curve of dis- 
tribution might have the shape of curve ( d ) . The average would of course 
be higher. 

Changing the test items would also affect the shape of tire distributions. 
Suppose most of the easy items were dropped from the test and many items 
were added which only the upper fraction of the group could pass. This 
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would tend to wipe out the differences between poor and very poor children 
and would stretch out the curve at the upper end. The resulting curve would 
again be shaped like curve ( d). Tliis is known as a skewed distribution. 

Bimodal distributions such as are shown in curve (e) are rarely encoun- 
tered except where one is testing a mixed group, some of whom have been 
trained and some of whom have not. If typical adults were tested in speed 
of typing, a bimodal curve might be expected. 

7. One could slice a normal distribution symmetrically into five parts, which 
would contain 7 percent, 24 percent, .38 percent, 24 percent, and 7 percent of 
the cases respectively. Grades are often so assigned: 7 percent A’s, 24 percent 
B’s, and so on. Sketch dividing lines cutting off roughly these percentages on a 
normal curve and on a skewed cur\'e. Is an A as far above C as an F is be- 
low C? 

8. If intelligence is distributed on a continuous curve rather than by types which 
can be distinguished sharply, what implication does this fact have for identify- 
ing and planning for the feeble-minded? 

9. Sketch a smoothed frequency curve for the distribution in Computing Guide 1. 
Is this distribution normal, skewed, or bimodal? 

METHODS OF DESCRIBING RELATIVE PERFORMANCE 
The performance of any man, in relation to that of the group, is indicated 
clearly if we locate his score against the frequency distribution. To make a 
permanent record of the man’s standing is difficult, however, since tlie en- 
tire distribution can scarcely be placed in his file. Besides, we need a con- 
venient method of recording liis standing on every test he takes, so that his 
different abilities may be compared with each other. For this reason, most 
users of tests must either compute or make use of such stati.stics as the mean, 
median, and percentile ranks. In the following sections these concepts are 
presented briefly. Statistics te.xts give more complete discussions of the con- 
cepts and show other methods of computation. 

Median and Percentile Score 

Mr. Gates could record the rank of each man, and this would conveniently 
express his standing in the group. But if, on other tests, Mr. Gates has records 
of larger and smaller groups, fte rank of 22 might sometimes be average, 
sometimes quite superior, sometimes quite poor. To eliminate this difficulty, 
interpretations are often based on percentile rank, that is, the person’s rank 
expressed as a percentage of the group. Ranks are figured from the top of the 
group, rank 1 indicating the best performance, but percentiles are counted 
upward, the best performance being near the one-hundredth percentile. If a 
person falls at the 75th percentile, this means that 75 percent of the group is 
below him.‘ 


^ In e.vact computation of percentiles the case under consideration is divided in half in 
the following manner: Suppose that there are 50 cases; 31 fall below case A, and 18 are 
above case A. We consider that half of case A falls in each group, and the percentile 
rank is SIK divided by 50, or 63. 
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percentile equiva- 
lent of a raw score 
of 26 is 42.) 
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Computing Guide 1 . Determining the Percentile Equivalents of Test Scores. 
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The 50th percentile is the median. The median is the middle score in the 
distribution, the score that is exceeded by half the group. The median is 
determined approximately by counting upwai'd from the bottom to fijid the 
middle of the group. In Mr. Gates’ group (see Computing Guide 1) there 
are 45 persons. Counting upward, we find that the middle score, twenty- 
third from the bottom, is 26. The median may be obtained more exactly 
from the smoothed distribution, along with other percentiles. The median is 
a convenient way of reporting the performance of the “typical ’ member of 
the group. 

Percentiles for a test are obtained by direct computation or by a graphic 
method. The graphic method is outlined in Computing Guide 1. It has the 
advantage of smoothing out minor irregularities in the distribution. For large 
groups, the graphic method and computation give nearly identical results. " 

Percentile scores on different tests must not be compared unless the groups 
on which they are.based are comparable. Published norms for tests are based 
on differently selected groups. A person at the 73rd percentile on the Purdue 
Pegboard, where the norms are based on industrial trainees, and at the 73rd 
percentile on the Bennett Test of Mechanical Comprehension BB, where 
norms are based on college freshmen, does not have equally high standing 
in these two abilities. For the Bennett test, separate percentiles for industr-ial 
trainees are also available. A score which is at the 73rd percentile among 
freshmen is at the 95th percentile among trainees. Wherever percentiles are 
used, the norm group involved must be kept in mind. 

A second problem is that percentile scores may not be added and averaged 
without introducing eiTor. This is shown by the example of two men, eacli 
of whom takes the EGY test twice. On the first test, Max earns 32 and Joe 
earns 29. On the second test, Joe again earns 29, but Max drops to 26. The 
raw scores of the two men have the same average. If the raw scores are con- 
verted to percentiles by tire table derived in Computing Guide 1, Max’s two 
percentile scores are 98 and 42, and his average is 70. Joe’s percentile score 
each time is 74, and that is his average. 

10. Interpret the following record of ability test scores for one person, where all 
scores are percentiles based on a random sample of adults: Verbal, 54; Num- 
ber, 46; Spatial, 87; Reasoning, 40. 

11. Determine approximately Alfred’s percentile rank in each of the four tests he 
took (Problem 6, p. 22). 

12. What is the median for Mr. Gates’ group, estimated from the percentile table? 

13. Raw scores usually change when a test is repeated, owing to chance errors of 
measurement. If each of the following persons changes two points up or down 
in raw score, on the EGY test, how much would his pei'centile scores change? 

a. A person with a percentile score of 54 on the first test. 

b. A person at the 98th percenUle on the first test. 

1^' P'rom the finding in Problem 13, what caution is suggested in interpreting per- 
centile scores? When are differences in percentile scores least reliable? 

15. The scores below are the times, in seconds, required by a group of persons to 
solve a puzzle. Prepare a table of percentile equivalents for this group. 
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52 34 41 42 46 45 27 48 35 35 38 29 54 36 33 30 

48 39 44 36 36 34 51 40 30 33 37 41 56 32 48 35 

37 28 28 45 31 39 31 27 35 36 34 42 38 33 33 31 

39 28 36 33 37 36 34 54 34 32 33 38 

16. According to the table prepared in Problem 1.5, how much diflFerence in abso- 
lute ability does a difference of 10 percentile points represent? 

Mean and Standard Deviation 

The second common system for describing distributions uses the mean 
and standard deviation. The mean (M) is the arithmetical average, obtained 
by adding all scores and dividing by the number of scores. The standard de- 



Fig. 4. Distributions of scores of two classes on the same test. 


viation (s, or s.d.) is a measure of the spread of scores. The variation of 
scores may be different on two tests even though the averages are the same. 
Figure 4 shows the smoothed distribution of scores of two classes taking the 
same test. Even though the groups are similar in mean ability, the distribu- 
tions are not at all alike. Group B contains far more very superior and 
inferior cases. Group B has a larger standard deviation than Group A. A gen- 
erally useful method of computing the mean and standard de\aation is out- 
lined in Computing Guide 2. 



Fig. 5. Percentage of cases falling in each portion of a 
normal distribution. 


If Mr. Gates followed this procedure, he could place in tlie record, along 
with each individual score, the information that the mean is 26.2 and the 
s.d. is 3.8. This, however, is fairly complicated to interpret, and the standard- 
score procedure presented below is more helpful. A particularly important 
use of the mean and standard deviation is for comparing one group with 
another. If Mr. Gates wished to compare EGY scores of habitual criminals 
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the/ and d columns, 
and enter in the /d 
column. 

Multiply the en- 
tries in the d and/d 
columns, and enter 
in a column headed 
fd Add the fd and 
fd ® columns. 


4. Substitute the num- 
bers in the follow- 
ing formulas: 

c(correction) = ^ c = ^ = .2 

jy 45 * 

M = A.O. + c M = 26.0 + .2 = 26.2 

.d. - 

s.d. = v'14.42 - .04 = v'RM 
s.d. = 3.8 



A.O. is the midpoint 
of the score-interval 
selected as arbitrary 
origin. 


Computing Guide 2. Determining the Mean and Standard Deviation. 
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with non-repeaters, the mean and standard deviation of each group would 
give a convenient basis for comparison. 

The standard deviation has a definite relation to the normal distribution. 
If scores are normally distributed, the standard deviation is the distance 
from the mean to the “point of inflection” on the shoulder of the curve. This 
is the point where the curvature changes from convex (the hill-like portion) 
to the concave tail. This relation is illustrated in Figure 5. A second valuable 
property of the standard deviation is that very few cases fall more than three 
standard deviations from the mean if the distribution is normal. Two-thirds 
of the cases fall within one s.d. of the mean. If we know the mean and stand- 
ard deviation, and can assume a roughly normal distribution, we can recon- 
stmct the distribution curve from these facts. 

17. a. In a normal distribution, what is the relation of the mean and median? 

b. What percentile rank corresponds to a score 2 s.d. above the mean? To a 
score 1 s.d. below the mean? 

18. a. Compute the mean and standard deviation for the scores given in Prob- 

lem 15, p. 28. 

b. How does the mean compare with the median computed previously? 

c. What is the approximate percentile rank for a score 2 s.d. above the mean 
in this distribution? 

19. Assuming that scores are normally distributed on a test where the mean is 
60 and s.d. is 8, interpret the following scores: Sara, 64; Harriet, 68; Charles, 
87; Bob, 48. 


Standard Scores 

One method Mr. Gates can use for recording results is to convert raw 
scores into standard scores. The standard score for a case is his deviation 
from the mean, expressed in terms of the standard deviation. On the EGY 
test, a score of 30 is just 1 s.d. above the mean; -1-1.0 is the z-score corre- 
sponding to a raw score of 30. A z-score of —2.5 shows that a person is 232 
s.d. below the mean. The computation of z-scores is illustrated in Computing 
Guide 3. The merit of the standard score is that if we know how many stand- 
ard deviations from the mean a person falls on a given test, we can make a 
fairly accurate interpretation of his position in a group. In this sense, stand- 
ard scores serve like percentile scores, but they have the additional advan- 
tage that they may be averaged and correlated without error. 

Another type of standard score, the T-score, adjusts scores so that their 
mean is 50 and their standard deviation is 10. In T-scores, 60 represents a 
score 1 s.d. above the mean, and 25 is 21* s.d. below the mean. Computation 
of T-scores is also illustrated in Computing Guide 3. Other conversion for- 
mulas are also possible. The Army converted scores on its tests so that the 
mean standard score would be 100 and the standard deviation would be 20. 
Wechsler converts subtest scores on the Bellevue Intelligence Scale to stand- 
ard scores with a mean of 10 and a s.d. of 3. 

The student is warned of a possible confusion between standard scores 



For tlic EGY sciorcs, 


Begin with tlie distribution of 
raw scores to be studied and 
compute the mean and stand- 
ard deviation by the procedure 
of Computing Guide 2 . 

To obtain z-scores, the mean is 
set equal to a z-score of zero. 
Tlie z-score equivalent of other 
scores is obtained by the for- 
mula: 

_ raw score — mean 
z score standard deviation 


If T-scores fire desired instead 
of z-scores, the mean is set 
equal to 50. T-scores are com- 
puted by the formula: 

T-score = 

50 4- score — mean) 

standard deAuation 


M = 26.2 
.f = .3.8 

Raw score 26.2 equals z-scorc 0. 

To get z-score equivalent of 20: 

20 - 26.2 -6.2 , , 
..score 3^ = _ = -1.6 

Raw score 20 equals z-score —1.6. 

To get z-score equivalent of 28 : 

28 - 26.2 1.8 . . 
^.score ^ 

Raw score 28 equals z-score 0.5. 

Raw score 26.2 equals T-score 50. 


To get T-score equivalent of 20 : 
- SO + 

.50 + lf -34 
Raw score 20 equals 7 -score 34. 
To get T-score equivalent of 28: 

r.,co,e - 50 + 

.50 + 11-55 


Raw score 28 equals T-score 55. 
Computing Guide 3. Computation of z-scores and T-scores. 
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computed by the method of Computing Guide 3 and “normalized scores,” 
which also are identified by the names s-score and T-score. The normalized 
scores are computed by a difiFerent and more complex method, which will 
not be important to us in this book. 

We can now review the methods available to Mr. Gates. If he wishes to 
record and report the performance of prisoner C. M., he can merely note 
the raw score of 24. He could make this more meaningful by reporting at 
the same time that the median is approximately 27. Or he could report it 
together with the mean (26.2) and the s.d. (3.8). If he wishes to convert 
to percentiles, he can report C. M.’s percentile score of 23, which is self- 
explanatory. If he prefers standard scores, he will record the z-score of —0.6 
or the T-score of 44, either of which is readily inteipreted. 

20. With reference to Figure 5, interpret each of the following: a z-score of 3.0; 
a z-score of —2.0; a T-score of 40; a T-score of 65. 

21. For the 1937 Stanford-Binet test, the mean IQ is 100 and the standard devia- 
tion is 16. Express in T-score form the following IQ’s: 100, 84, 132, 1.50, 70, 
105. 


Comparison of Percentiles and Standard Scores 

Both percentiles and standard scores are widely used, and under some 
conditions one may be translated into the other. The percentile score has 
these advantages: It is readily understood, which makes it especially satis- 
factory for reporting data to persons without statistical training; it is easily 
computed; it may be interpreted exactly, even when the distribution of test 
scores is skewed or bimodal. The disadvantages of percentile scores are 
these: They magnify small diSerences in score near the mean which may 
not be significant, and they reduce the apparent size of large differences in 
score near the tails of the distributions; they may not be used in many statis- 
tical computations. The advantages of standard scores are as follows: Differ- 
ences in standard score are proportional to differences in raw score; use of 
standard scores in averages and correlations gives the same result that would 
come from use of the raw scores. The disadvantages are that standard scores 
cannot be interpreted readily when distributions are skewed, and that un- 
trained persons are generally unable to understand them. Psychologists are 
increasingly ui-ging that standard scores be used wherever possible. Testing 
experts believe that percentiles are too often misleading, and that if stand- 
ard scores were consistently used non-statisticians would quickly learn to 
interpret them ( 11 ) . 

When the variable tested is normally distributed, standard scores may be 
converted into percentile scores and vice versa by means of the table of the 
normal curve (Table 1). This table is widely used in statistical work. It is 
based on the percentage of cases falling within each section of the normal 
curve, as shown in Figure 5. The table is symmetrical; a score 2 s.d. from 
the mean in either direction cuts off the same proportion of cases. The rela- 
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Table 1. Relations Between Standard Scores and Percentile Scores, When Raw 
Scores Are Normally Distributed 


Distonce 
from Mean 
in s.d. 
(z-Score) 

T-Scor« 

Percentile 

Score 

Percent 
of Coses 
in "Tail" 
of Curve 

Percentile 

Score 

T-Score 

Distance 
from Mean 
in s.d. 
(z-Score) 

3.0 

80 

99.9 

0.1 

0.1 

20 

-3.0 

2.9 

79 

99.8 

0.2 

02 

21 

-2.9 

2.8 

78 

99.7 

0.3 

0.3 

22 

-2 8 

2.7 

77 

99.4 

0.4 

0.4 

23 

-2.7 

2.6 

74 

99.5 

0.5 

0.5 

24 

-2.4 

2.5 

75 

99.4 

0.4 

0.4 

25 

-2.5 

2.4 

74 

99.2 

0.8 

0.8 

24 

-2.4 

2.3 

73 

99 

1 

1 

27 

-2.3 

2.2 

72 

99 

1 

1 

28 

-2.2 

2.1 

71 

98 

2 

2 

29 

-2.1 

2.0 

70 

98 

2 

2 

30 

-2.0 

1.9 

49 

97 

3 

3 

31 

-1.9 

1.8 

48 

94 

4 

4 

32 

-1.8 

1.7 

47 

94 

4 

4 

33 

-1.7 

1.4 

44 

95 

5 

5 

34 

-1.4 

1.5 

45 

93 

7 

7 

35 

-1.5 

1.4 

44 

92 

8 

8 

34 

-1.4 

1.3 

43 

90 

10 

10 

37 

-1.3 

1,2 

42 

88 

12 

12 

38 

-1.2 

1.1 

41 

84 

14 

14 

39 

-1.1 

1.0 

40 

84 

16 

14 

40 

-1.0 

0.9 

59 

82 
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18 

41 

-0.9 

0.8 
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79 

21 

21 

42 

-0.8 

0.7 

57 

74 

24 

24 

43 

-0.7 

0.4 

54 

73 

27 

27 

44 

-0.6 

0.5 

55 

49 

31 

31 

45 

-0.5 

0.4 

54 

44 

34 

34 

46 

-0.4 

0.3 

53 

42 

38 

38 

47 

-0.3 

0.2 

52 

58 

42 

42 

48 

-0.2 

0.1 

51 

54 

44 

46 

49 

-0.1 

0.0 

50 

50 

50 

50 

50 

0.0 


tionship between the standard deviation and the percentage distribution of 
cases is valuable in tlte solution of many statistical problems. Whenever a 
variable is normally distributed, one may use Table 1 to reconstruct the dis- 
tribution. 

One of the most valuable methods for presenting test scores so that difiEer- 
ent tests may be considered simultaneously is the profile. A profile is a graph 
showing the subject’s performance. In order to compare him on several tests, 
all scores are expressed as standard scores or percentile scores, using the 
same norm group. Illustrative profiles are found in Figures 30 and 39 in later 
chapters. 

22. If a teacher wishes to keep records of class examinations in derived-score form, 
so that he can tell at a glance how well a person is doing and can average all 
tests equally in the final grade, should he use percentiles or standard scores? 
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23. In a certain college, all freshmen are given a “college aptitude test.” The re- 
sults are to be mimeographed and C'onndential copies given to all student ad- 
visers. Should the report use raw scores, standard scores, or percentiles? 

24. The psychometrist giving tests for a Veteran’s Counseling Center is asked to 

f ive a wide variety of tests to veterans needing counseling. After each man 
as taken from four to eight tests, results are to be placed on a standard re- 
port form so that performance on all tests can be compared by the counselor, 
in conference with tlie veteran. What problems will be encountered if all re- 
sults are reported in percentiles? WTiat problems will be encountered if all 
results are reported as standard scores? Can you suggest a suitable plan for 
report? 

25. The American Air Force used a derived score called the “stanine” (stay-nine) 
in which 1 represented the lowest 4 percent; 2, the next 7 percent; 3, 12 per- 
cent; 4, 17 percent; 5, the middle 20 percent; and 6, 7, 8, 9 included corre- 
sponding percentages at the higher end of the distribution. Describe this sys- 
tem in tenns of T-scores, using Table 1. 

CORRELATION 

Meaning of Correlation 

Nearly all research on testing involves correlation. Correlation is a method 
of summarizing the relationship between two sets of data. It is the most com- 
mon method for reporting the answer to such questions as the following: 
Does this test predict performance on the job? Do the diagnoses made from 
this test agree with diagnoses established by the case histories? Do scores 
made on this test a year ago agree with the scores the same people make 
now? 

Correlation (r) may take any value from —1.00 to -tl.OO. If the standing 
of individuals in one test corresponds exactly to their standing in another, 
the correlation is 4-1.00. A correlation of 1.00 means that we could predict 
perfectly one test from the other. Psychological tests are rarely so accurate. 
If two variables are completely unrelated, so that one cannot be used to 
predict the other, the correlation is .(K). Near zero correlation might be ex- 
pected between speed of tapping and reasoning ability, or between ability 
to match colors and skill in archery. Where the degree of relationship is 
positive but imperfect, so that some prediction can be made, the correlation 
is expressed by some such number as .24, .63, or .87, depending on the ex- 
tent of correspondence. Negative correlations are found when the relation 
between two variables is inverse, the person standing high on one test hav- 
ing a low standing on the other. A correlation of —1.00 is found between 
Number Right and Number Wrong on the same test, if everyone attempts 
every item. Negative relationships of intermediate size are sometimes found 
between aptitude and time required to do a job, or between somewhat op- 
posite personality traits such as self-confidence and submissiveness. 

Table 2 illustrates situations in which each degree of relationship is en- 
countered. Correlations over .80 are uncommon, Correlations below about 
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.60 are frequently useful for establishing that a relationship exists, but when 
the relation is so imperfect one variable does not predict the other very ex- 
actly. 


Table 2. Variables Yielding Correlations of Various Sizes 



First Variable 

Second Variable 

Reference 

1.00 

Length of side of square 

Diagonal of square 


.93“ 

IQ on Stanford-Binel. Form L 

IQ on Form M a few days lateri chll- 




dreji of uniform age. 

10,p. 47 

.88 

IQ on Stonford'Binet, 1916 form 

]Q on same test about 15 months later, 



for problem children. 

1 

.80 

Row score on Oils Self-Administering 




Test of Mental Ability 

Age, for persons 7 to 1 7 

7 

.75 

interest scores of college seniors 

Interest scores five years loter. 

9 

.70 

IQ on Henmon-Heison group test for 




high-school seniors 

Same, two years later 

6 

.65“ 

Score on o well-prepared one-hour 

Score on a similar test given shortly 


classroom examination 

after 

8, p. 472 

.50 

Score on essay examination fn Engh*^ 

Score on objective test of same ma- 



literature 

teriai 

4, pp. 92-99 

.45“ 

Intelligence test score 

College marks 

8, p. 848 

.25“ 

Score on test of moral knowledge 

Measures of actual conduct showing 



character 

8, p. 1 68 

.20“ 

IQ 

School marks in foreign language 

3 

.15“ 

Intelligence 

Composite index of bodily size 

8, p. 1 83 

-.20 

Score on Linfert-Hierholzer infant tests 




at one year 

IQ on Stanford-Binet at 4 years 

2 

-.50 

Score on Otis Self-Administering Test 




of Mental Ability 

Age, for adults from 20 to 92^' 

7 


** Approximate; btised on several correlations. 

There is little or no decline of ability in early adulthood, marked decline in later years. 


26, Wliat correlation would you anticipate between the following pairs of vari- 
ables? 

a. Age and annual income of men aged 20 to 50. 

b. Age in January, 1930, and age in March, 1950. 

c. Scores on two intelligence tests, given the same week. 

d. Annual income and number of children, among married urban men. 

e. Maximum and minimum temperature in Wichita, each day for a year. 

If two scores are available for each person in a group, a scatter diagram 
may be made to show their relationship. One axis for the diagram represents 
one score, and one represents the other. A point is plotted to show the scores 
of each person. A scries of scatter diagrams representing different degrees of 
relationship is shown in Figure 6. If the two scores are highly correlated, the 
points fall close to a straight line, and the correlation is high. If the two 
scores have no relationship, a case having a given score on the first test may 
fall anywhere along the scale for the other test, and tlie scatter distribution 
'Will be nearly round. 

A high correlation is often erroneously interpreted as showing tliat one 
factor causes the other. There are at least three possible explanations for a 
high correlation between two variables, A and B. A may cause or influence 
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the size of B, B may cause A, or both A and B may be influenced by some 
common factor or factors. The correlation between score on a vocabulary 
test and score on a reading test may be taken as an example. Does good 
vocabulary cause one to be a good reader? Possibly. Or does ability to read 
well cause one to acquire a good vocabulary? An equally likely explanation. 
But the results could also be explained as resulting from high intelligence, a 
home in which books and serious convei’sation abounded, or superior teach- 
ing in the elementary schools. Only a theoretical understanding of the proc- 
esses involved, or controlled experiments, permit us to state what causes 
underlie a particular correlation. Otherwise, the only safe conclusion is that 
correlated measures are influenced by a common factor. 

If the correlation is less than perfect, one measure, or both, is influenced 
by some factor not common between the measures. Random errors of meas- 
urement, which occur independently in the two measures, lower correlation. 
So do causal factors not measured in both variables. The correlation between 
intelligence and school marks is only moderate because many factors besides 
mental ability influence the marks one receives. These include pupil effort, 
teacher bias, previous school leai-ning, health, and so on. 

27. What possible causal relations might underlie each of the following correla- 
tions? 

a. Between amount of education and annual income of adults (assume that r 
is positive) . 

b. Between average intelligence of children and size of family (assume r neg- 
ative). 

c. Between Sunday-school attendance and honesty of behavior (assume r 
positive). 

28. Account for the correlations in Table 2 in terms of possible causes. What com- 
mon elements might lead to the relations found? What factors not common to 
both tests might reduce these correlations from 1.00? 

Figure 7 indicates the relation between the size of a correlation coeflicient 
and the accuracy of estimation of one variable from the other. If the correla- 
tion of a test with another variable is zero, prediction by means of the test is 
no more accurate than an arbitrary guess that every pupil will fall at the 
mean on the second variable (8, p. 594). Such a guess will be very wrong 
for most cases. As the correlation rises, estimation becomes more accurate, 
but errors are still substantial even when r is .90. A correlation of .50 permits 
predictions only slightly better than guessing. 

Computation of Correlation 

The product-vioment correlation procedure outlined in Computing Guide 
4 is closely related to the procedure for computing means and standard de- 
viations. Points are tallied on the scatter diagram, and both means, both 
s.d.’s, and the correlation are computed. Variations in the procedure and 
checks are discussed in most statistics texts. Numerous forms have been pub- 
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Fig. 7. Index of forecasting efficiency, showing improvement of estimation as the 
correlation coefficient increases. 

lished which reduce the labor of calculation. The product-moment method 
is generally employed for testing relationships, although other methods are 
more suitable for special problems, such as when regression lines are curved. 

SAMPLING ERRORS 

Although a full treatment of the topic requires more space than is avail- 
able, attention must be drawn to the problem of sampling errors. Any score, 
and any average, percentile, or correlation, is an estimate. The score on a 
test would be different if we had tested the person at another moment. The 
average would be different if one or two different persons had been included 
in the scores averaged. Any statistic is only an approximation, since many 
factors determine the particular value obtained. This is especially true of 
norms, for when one attempts to state the average finger dexterity of typical 
ninth-graders, the attitude of typical Americans about tax reduction, or, the 
normal percentage of maladjusted among unmarried women, what figure he 
reports depends upon which persons happen to constitute his sample. The 
reader of statistics must not take any number at its face value. It is neces- 
sary to realize that a different value — ^not much different, it is hoped ^would 
have been obtained with other cases or if the test had been given at a differ- 
ent time. The experimenter usually guards against serious sampling errors 
by studying typical persons and by using a large sample. An average based 
on ten persons would change appreciably if only one member of the group 
were changed; an average based on five hundred would scarcely show, the 
difference. Formulas for estimating the exact magnitude of sampling errors 
in correlations, means, percentiles, and other statistics are discussed in sta- 
tistics texts. 
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gram, entering one 
tally for eacl i pair of 
scores. (The first 
pair [24-35] is tab- 
ulated in tlie cell 
above 24 on the X 
scale, and opposite 
35 on the Y scale. 
This (!ell is outlined 
in the illustration.) 
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121 81 49 36 25 16 9 20 7 

3 

12 36 11250 

72 649 




3. Count the number of tallies in each 
column, and write it below the dia- 
gram in a row labeled /i. Count the 
number in each row, and write it 
beside the diagram in a column 
labeled 


Select an arbitrary origin for X and 
for Y, and determine the mean and 
standard deviation for each as in 
C. G. 2 (computation not shown). 


.4.0.x = 26.0 .4.0.„ = 37.0 

Cx = .20 e, = .29 

A/x = 26.2 M„ = 37.3 
4‘x = 3.79 == 3.68 


In each cell of the scatter diagram, 
multiply the number of tallies by 
the. value of d~ written below that 
column, and write the product in 
the cell. (In the outlined cell, for 
instance, there arc two tallies, and 
rf»is — 2; the product is —4.) 


In each row, add the nuinhers writ- 
ten in the cells, and place in a 
column labeled /„dx. 


Multiply each entry in this column 
by d, and enter in a column labeled 
fdxdy. 


Add the column/rfxd». 


Substitute the numbers in the fol- 
lowing formula : 


- .06 


3.79 • 3.68 


10.51 - .06 10.15 

13.91 13.94 


Ptu = .10 


Computing Guide 4. Computing the Product-Moment Correlation CoefTicieut 
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CHAPTER 4 


How fo Choose Tests 


WHAT IS A GOOD TJST? 

When a teacher investigates the intelligence of her pupils, she wishes to use 
the best intelligence test available. An industrial psychologist hoping to se- 
lect superior workers for a factory wishes to try the best possible test of 
intelligence. The clinical psychologist, studying a child who may be feeble- 
minded, wishes to use the intelligence test which will give the fairest results. 
It is not surprising, then, that the test user repeatedly asks, “What is the best 
test in this field?” 

The purchaser of tests certainly has a confusing problem. He is faced with 
a choice among long tests and short tests, famous tests and unfamiliar tests, 
old tests and new tests, ordinary tests and novel tests. Tlie catalogue of a 
leading test distributor offers 35 tests of general mental ability, and 18 tests 
of personality. Each of these tests was published by a person who thinks his 
test is in some way superior to the others on the market. He is frequently 
correct. Different tests have different virtues; there is no one test of any sort 
which is "the best” for all purposes. The teacher, the industrial psychologist, 
and the psychologist need good tests of mental ability. But the test 

which serves one most usefully is probably not the best for either of the 
others. 

Tests must always be selected for the particular purpose for which they 
are to be used. Some tests work well with children but not with adults; 
some give precise measures but require more time than can be allowed for 
testing; some give satisfactory general estimates of subjects but less detailed 
diagnosis of each individual than another test will. No test maker can put 
into his test all desirable qualities; a change in design improves the test in 
one respect only by sacrificing some other characteristic. 

Not all published tests are good tests, however. Some have been published 
without adequate research and refinement. Some, even those having wide 
popularity, do not succeed in measuring what they were intended to meas- 
ure. Some measure characteristics different from what their titles indicate. 
Test publication is a commercial activity, and, although most test authors 
and publishers maintain high standards, the description of a test furnished 
by the author usually advertises its favorable features. Since some published 
tests are nearly worthless, and others found useful for one task will not per- 
form well in another situation, the user must be able to choose among tests 
intelligently. No list of “recommended tests" can eliminate the necessity for 
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careful choice for the individual situation. Even in .similar situations the 
same tests may not be appropriate. Different schools find different tests best. 
Tests which select supervisors well in one plant prove valueless in another. 
And clinicians choose different tests for each patient. 

Any list of superior tests soon becomes outdated, so that it is necessary for 
the user of tests con.stant]y to evaluate new developments. New tests are 
added, new uses of tests are discovered, and new findings about old tests 
are brought to light. Louttit and Browne found that of the nine psychologi- 
cal tests most used in clinics in 1935 only two remained when a second sur- 
vey was made in 1946. Only two of the newcomers to the top nine were 
published later than 1935; the other five were available in 1935, but their use- 
fulness had not been recognized (36). Buros, who established a “yearbook” 
of test reviews, comments as follows ( 9, p. 11 ) : 

“Instead of limiting the volume to reviews of new and recent tests . . . the de- 
cision was made to include old as well as new tests. Reviews of old tests may prove 
effective in eliminating from use many tests which were among the best in their 
day but are now outmoded and inferior to recently constructed tests. On the other 
hand, such reviews may result in increasing the use of old tests and testing tech- 
niques which compare very favorably with tests being currently published. The 
sale of outmoded and ever-decre.isingly valid tests persists far beyond the sale of 
textbooks published in the same years. Despite the fact that tests have been on the 
market for five and fifteen years, there exists for probably ninety percent of the 
tests, a dearth of critical information concerning their reliability and validity. Ap- 
praisals and reappraisals of old tests are needed almost as badly as evaluations of 
new tests.” 

1. Improving a test in one way weakens it in another.” What advantage, and 
what disadvantage, comes from each of the following changes? 

a. Lengthening a test. 

b. Making it more interesting to children. 

c. Making it more diagnostic of strong and weak points. 

d. Giving it as an individual instead of as a group test. 

2. This is a letter received by a psychologist from an industi'ial personnel manager 
hiring office and factory workers. How would you answer it on the basis of 
the paragraphs above, knowing that the tests mentioned are representative of 
their type? 

“. . . Just now we are planning the use of the follo'wing tests: Otis intelli- 
gence, Humm-Wadsworth Temperament Scale, and aptitude tests related to 
our openings, such as the MacQiiarrie mechanical aptitude test. Does this 
seem to be a well-balanced testing schedule for industry? Are there tests that 
you think preferable to the.se?” 

PRACTICAL CONSIDERATIONS IN CHOOSING A TEST 

The most important factor in the suitabffi^-'ora Test is whetlier it does 
the job it is supposed to do. But in school, industrial, and clinical situations, 
the tester must bow to numerous practical considerations which limit the 
tests he can use. Of all the tests available, he must cross off those which are 
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impractical before making a careful investigation of the validity of the re- 
mainder. Practical considerations never justify using a test which gives 
worthless information, but a technically sound test cannot serve where it 
is impractical. The most significant practical considerations are cost, time 
required, ease of administi-ation and scoring, comparable forms, face validity 
and interest, user acceptability, and usefulness of results. 

T he cost of the usual te st is only a few cents, but when one is testing a 
number of persons, a difference in cost may be important. Fortunately 
there is little relation between the cost of tests and their quality, so that 
even a limited budget permits the use of well-constructed tests. The cost 
is greatly reduced where it is possible to u.se an answer sheet and a re- 
usable question booklet. Most tests intended for use with large groups are 
nov? published in this form. With young children or persons having little test 
experience, however, a separate answer sheet may prove confusing. In de- 
termining the cost of a test, one must consider not only the co.st of the ma- 
terials but also the cost of scoring. Cost must not be overweighted in 
comparing tests. Cost of tests is trivial compared to the cost which results 
from a wrong decision: training a worker who will not succeed on the job, 
or giving a semester s college education to a boy who will fail, or hospitaliz- 
ing a soldier who becomes neurotic under the shain of combat. Even sev- 
eral dollars spent to reduce such errors are well spent. 

3. Scates, in a careful cost accounting, found that purchasing, giving, scoring, and 
analyzing statistically the Stanford Achievement Test cost 56 cents per pupil at 
1930 prices (40). This is a battery of ten group tests, requiring two and one- 
fourth hours of testing. Would costs be expected to run higher or lower in test- 
ing applicants for industrial jobs? 

a test are frequently significant. Where only a 
limited time is available to obtain information, it may be necessary to use a 
short test rather than a longer one which would be more accurate. Some- 
times several short tests give a more complete picture of die person than a 
single long one, Tlie length of a test has an important effect on the coopera- 
tiveness of the subject. If a test begins to bore him, he is likely not to show 
his best ability and may develop antagonism toward the organization test- 
ing him. Witli subjects havit^ftaigh morale, on die other hand, even a very 
long test can be used. 

Ease oj admi nistrat ion must be c^jidered, since some tests 

require tiie sendees of expertly trained testers and scorers. Unless such 
services are available, it will be impossible to use the tests validly. Even 
where skilled persons are available, it is sometimes wise to use simpler tests 
which others can give, conserving the time of the expert for otiier duties. In 
schools where classroom teachers can give the tests themselves, they often 
take a greater interest in the results than when “outsiders” must be called in. 
On the other hand, a testing program which requires that teachers score 



46 


ESSENTIALS OF PSYCHOLOGICAL TESTING 



Fig. 8. Portion of answer sheet for machine scoring. (Courtesy International 
Business Machines Corporation.) 


large numbers of papens becomes unpopular; tests that can be scored by 
clerks are more readily accepted. 

Scoring is facilitated by eflBcient scoring stencils. The most useful design 
requires the student to mark his answer by checking or blackening one of 
several spaces; a punched-out key showing only correct answers is then 
superimposed for scoring. In large testing programs, the International 
Business Machines test-scoring machine is economical. It uses an answer 
sheet with spaces to be blackened to indicate correct answers. Electrical 

"fingers” in the machine 
sense where pencil 
marks appear, since the 
gi-aphite in the marks 
carries electi'ical current. 
The total number of 
marks in the proper 
places is shown on a 
meter. The machine 
will report number of 
errors, rights-minus- 
wrongs, and other types 
of scores. With it one 
can score about 500 pa- 
pers per hour. Military 
Fig. 9. IBM test-scoring machine. (Photo, courtesy classification centers re- 
International Business Machines Corporation.) lied on the machines to 

speed processing of re- 
cruits. Large school systems operate such machines for scoring of tests for 
the entire system. In most sections of the country, a test-scoring service is 
available where tests may be machine-scored for a moderate fee. 
Comparable forms are helpful when tests are used for research or for 



measuring the effect of teaching or therapy. It i.s desirable to use equivalent 
tests before and after, rather than repeating the same test, in order to rule 
out the effect of memory. An equivalent form is also useful to confirm a test 
score which is possibly inaccurate owing to the emotional disturbance of the 
subject or some other unfavorable condition. 

Face validity and interest are important in obtaining cooperation from 
the person tested. If the applicant for a job is giv'en a test which seems to 
him silly or unrelated to the job, he is likely to be resentful. If he is not hired, 
he may excuse his failure by blaming the tests as unfair; such a report dam- 
ages public relations and makes it harder to obtain job applicants. Some 
satisfactory workers have had little schooling and are distrustful . of tests 
which probe their weaknesses. Catch questions and questions which seem 
childish are especially likely to arouse criticism. If a test is interesting, 
taking it is likely to be a pleasant experience. This not only tends to make 
the scores valid, but also helps establish good relations between the per- 
sonnel worker and the subject. 

The acceptahility. of a test to those who will use the results is equally im- 
portant. Although fully trained psychologists will examine a test on a tech- 
nical basis, other users of test results have strong prejudices. If a group of 
social workers is accustomed to and likes results from mental test A, when 
the psychologist decides to substitute mental test B he will encounter 
difficulty. Even if test B is more accurate than A, the social worker may 
disregard the results from B because it does not have his confidence. So 
important is user acceptability in working with teachers, industrial personnel 
men, physicians, and others that the psychologist must often use a test 
which would be his second or third choice. If he is certain that he should 
wean others from their initial preference, he must undertake a deliberate 
campaign in favor of the test he advocates. 

One further practical consideration is the mefulness of results from a 
test. If results are couched in terras which require an expert interpretation, 
only the expert can use the scores, whereas another test may provide facts 
in a form useful to everyone working with the subject. Some tests give very 
limited information, which can be used only once, but a slightly more elab- 
orate test will provide data to answer numerous questions in the future. 

4. Decide which of the practical considerations mentioned above would be very 

important, which important, and which unimportant, in each of the following 

situations. 

a. The Army desires to test every man inducted during wartime to determine 
which ones may be feeble-minded. 

b, A college wishes to test every freshman during registration to determine 
which ones should enter a course in remedial reading. 

e. A court wishes a mental test to determine whether a prisoner is mentally 
defective. 

d. The Gallup poll wishes to determine the mental level of a cross section of the 
voting public while interviewing them (47) . 
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TECHNICAL CRITERIA OF A GOOD TEST 
Validity 

The one indispensable characteristic of a test is validity. A test is valid 
to the degree that we know what it measures or predicts. There arc two basic 
approaches to validity: logical analysis and empirical analysis. In logical 
analysis, one attempts to judge precisely what the test measures. In empirical 
analysis, one attempts to show that the test is correlated with some other 
variable, and therefore measures the same thing. Judgment leads to a 
psychological characterization of the test; e.g., “it measures speed of re- 
petitive finger movement,” or “it measures ability to visualize three-dimen- 
sional forms.” An empirical analysis usually relates the test to a practical 
purpose: "The test measures aptitude for pharmacy” (i.e., it correlates with 
success in pharmacy). 

Logical Validitij 

Every test is to some degree impure, and very rarely does it measure 
exactly what its name implies. Before the tester can understand test results, 
he must know whether his “intelligence” test is strongly influenced by how 
well tile subject reads, whether his “scientific aptitude" test is heavily in- 
fluenced by the subject’s training in mathematics, and whether his test of 
“open-mindedness” is influenced by the subject’s desire to make a good im- 
pression. One way to judge validity is to make a careful study of the test 
itself, to determine what test scores mean. 

Logical analysis of a test aims at psychological understanding of the 
processes that affect scores. It does not stress validity for any particular 
pmpose. If the user knows just what the test score tells about the subject, 
he can decide for which purposes it is appropriate. For this reason, the defi- 
nition above has been given rather than the conventional “Validity is the 
extent to which a test measures what it purports to measure.” Only empirical 
validity is concerned with the usefulness of a test for some single pui-pose. 

Validity may be established deductively by showing that a test corre- 
sponds to the definition of the trait intended to be measured; or it may be 
established inductively by naming the traits represented in the items at 
hand. If we have a clear definition of a trait the test is supposed to measure, 
we can examine the items to see if they conform to that definition. For ex- 
ample, a test may attempt to measure knowledge of vocabulary. Before we 
can analyze it, we must know what is meant by “knowledge” and "vocab- 
ulary.” If the tester then informs us that by “knowledge” he means “ability 
to give a definition,” and that “vocabulary” refers' to ‘Vords commonly 
ilsed-iirseventh-grade textbooks,” W cah judge the test. If the test asks the 
student to give definitions of many words, and we find those words to be 
representative of seventh-grade textbooks, the definition describes the test. 
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But if tlie words are largely from science or are more diflBcult than those 
in seventh-grade texts, or if the items ask the student to match words with 
printed definitions, the score measures something other than what the defi- 
nition implies. 

Even when items relate to a definition, they may bring in irrelevant vari- 
ables which make the test impure. The following five criticisms illustrate 
the types of weakness a careful analysis discloses. 

1. The Relative Movement Teat used by the.Nayyisintended to selectmen 
who.Tiecause of small-boat experience or aptitude, will be able to visualize 
how ships on various courses will move in relation to each other. The te.st 
consists of problems about “Ship A, one mile north of ship B, proceeding SW 
at five knots, etc.” One landlubber of the worst sort was highly successful on 
the test because he could work the problems by geometry, while the mari- 
ners in the tested group were solving them by visualizing the sea situation. 
At sea, of course, the high-scoring men with sea experience could state 
almost without thinking where a given boat would be in fifteen minutes, 
whereas the mathematician, for all his high score, was completely at a loss. 
For some men the test was one of movement visualization, but he passed 
it on the basis of mathematical training. Many tests permit minor or irrele- 
vant factors to infiuence the scores. 

2. The Wechsler-Bellevue mental test designed for the Army required 
subjects to assemble cutout puzzles which foimed objects. One cutout of 
a house seemed to have little relationship to ability. It was eventually con- 
cluded that this was due to a “cultural factor”; persons coming from regions 
where houses like the puzzle house were uncommon were handicapped on 
the test despite normal mental ability. The house item was replaced by a 
face to be assembled. Cultural factors make it diflScult to obtain valid tests of 
intelligence. 

3. Pitch discrimination is tested by Seashore and others by means of 
recorded^'soimdsr. 'The subject hears two successh'e tones and must tell 
whether the second is higher or lower than the first. On difficult items, it 
was found that some persons usually said “higher” while others were habit- 
ually biased toward “lower.” Since tliis tendency affects the score the person 
receives, the test is not a pure measure of pitch discrimination. “Re.sponse 
sets” — ^personal habits of responding in borderline judgments — dilute 
some achievement and personality tests. 

4. Traditional tests in school subjects have been criticized because they 
do not show the student’s ability on all aspects of the course. If a course is 
supposed to teach historical facts and to develop ability to understand 
modern social problems, a test which asks only fact questions rewards 
memorization of facts but does not measure ability to apply tliose facts. 
Similar criticisms have been made of many testing and grading practices 
which test only some aspects of the variable under consideration. 

5. The meaningfulness of any test sc-ore is also lowered by chance errors 
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of measurement. These are estimated by reliability formulas ( see below ) . 

Where an adequate definition of the function tested is not avaiUible, the 
test user studies the items to determine what definition they would fit. This 
is at best an uncertain process. A skillful analyst calls on a fund of knowl- 
edge about how people react and about factors found in previous studies of 
tests. Add to these keen observation and flexible thinking, and he is likely 
to know much about the test when he has fini-shed. 

But his final conclusion is at best only an enlightened guess, unless he can 
confirm his thinking by empirical studies of the test. This point cannot be 
too sti'ongly emphasized. Over and over, it is found that tests which “ought 
to” predict some behavior do not. No test can be relied on for practical use 
until it has been validated empirically. 

Attention must be given, not only to the content of items, but also M the 
form of presentation. This is demonstrated in studies of response sets. Most 
students tend to answer “true” rather than “false” when in doubt on a true- 
false test. This sometimes means that the student with little knowledge 
guesses correctly on true items but tends to guess wrong on false items. For 
one examination in psychology, three scores were obtained for each student: 
his score on the hue items only, on the false items only, and on the total test. 
The reliability coefficients (split-half) of the three scores were .11, .72, and 
.62, respectively. ( The meaning of this coefficient is explained on p. 66. ) The 
correlations with a composite of other tests in the course wei'e; true items, 
.22; false items, .67; total, .67. In other words, the false items alone tended 
to measure knowledge as well as the total test, but the true items were so 
much influenced by the habit of saying “yes” when in doubt that they were 
unreliable and invalid (12). 

A response set is a habit which causes the subject to earn a different 
score from what he would earn if the same items were presented in a dif- 
ferent form. Six types of response sets have been identified ( 15 ) : 

1. Tendency to gamble. When a subject does not know the answer to an 
item, he may leave it blank or may attempt to give an answer. This fac- 
tor is eliminated only when every person attempts every item. 

2. Different definitions a.ssigned respon.se categories such as “agree” or 
“like.” Items like the following are common in attitude tests, where the 
subject is to mark each statement “agree,” “uncertain,” or “disagree.” 

A U D We cannot have lasting peace in a world where some nations are 
undemocratic. 

A U D Compulsory military training is the best way to keep the United 
States safe from attack. 

A U D The United States must not surrender to an international body its 
right to go to war, 

1 

Some subjects define “uncertain” very narrowly, using it only when they 
have absolutely no opinion regarding the item. Others consistently use 
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“uncertain” for any statement about which they are not absolutely positive. 
With these different habits of responding, two subjects with the same 
beliefs would receive different scores. 

3. Tendency to give many responses, when the subject is to check as many 
statements as he wishes. Some reasoning tests, for example, present a list 
of possible reasons and ask the subject to mark those which support a 
certain conclusion. A subject who marks a great many reasons is almost 
certain to include several poor ones, whereas the subject who stops after 
picking two or three best reasons will seem to show verj' accurate reason- 
ing. 

4. Bias, in which the subject tends to give one particular answer when in 
doubt. This has been illustrated above in regard to pitch tests and true- 
false tests. Similar bias is found in agree-disagree or like-dislike judg- 
ments. 

5. Working for speed rather than accuracy. In any timed test, the subject 
sets himself a pace. If it is a fast pace, he takes chances of inaccuracy. If 
it is a careful pace, he may not complete as many items as he otherwise 
would. Sometimes a rapid pace will yield the highest score, and some- 
times the slower pace will be rewarded. 

6. Style of answer in essay tests. Everyone writing a free response has a 
choice of many styles: terse, elaborated, flowing, or outlined. Scorers 
are influenced by such factors. 

These response sets appear to be personality characteristics. By influenc- 
ing scores on tests of ability or attitude they dilute the degree to which scores 
depend on the variable to be measured. The multiple-choice item, with the 
subject forced to choose on every item, is the only form in common use for 
pencil-and-paper tests which appears to be free from response sets. This is 
one reason why it is being increasingly used in tests of knowledge, attitudes, 
interests, and personality. The following items illustrate the forced-choice 
technique: 

Select the answer which best completes the statement: 

Magellan discovered: Cuba the Philippines Japan India 

Underline the adjective in each pair which best describes you: 
charitable — ^industrious gloomy — careless 

Which of the following would you prefer to do, assuming you had the needed 
abilities? 

Be a skywriter 

Write newspaper editorials 

Be a weather forecaster 

Often, reconsideration is advisable when a subject s unusual response set 
may have lowered his score. This comment is made about one of the Differ- 
ential Aptitude Tests (6, p. E-5 ) : 
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“The Clerical Speed and Accuracy Test is designed to measure the student’s 
speed and accuracy with simple number and letter combinations. It is the one test 
in the entire series which places a heavy premium on speed. As the test is scored, 
in fact, the student is not additionally penalized for any errors he may make, i.e., 
his score is the number of items correctly marked — wrongs are not subtracted. 
This scoring decision was based on research which indicated that in a ta.sk of the 
simplicity of this one, there are few, if any, errors made. ... In interpreting 
Clerical scores of students whose other scores are high, it is vital that low scores 
be eyed suspiciously. With good all-round students, a low score on this test is 
just as likely to indicate the student’s stress on correctness as it is to indicate 
real inability to work rapidly. The school ordinarily emphasizes accuracy rather 
than .speed, and properly so; it is not surprising then that some good students 
will continue working in that way despite instructions to work rapidly. If Clerical 
Speed and Accuracy is considered by a counselor to be important for a specific 
good student whose score is low, it would be wise to retest him after re-empha- 
sizing the importance of speed. (Forty-two pupils who scored high on the Verbal 
Reasoning and Numerical Ability tests and low on the Clerical Speed and Ac- 
curacy test, were retested after being urged to work rapidly. Their scores all 
increased appreciably.)” 

5. A test of intelligence uses items like the following: 

sweet— sour SAME— OPPOSITE 

obscure — lucid SAME— OPPOSITE 

occult— mystical SAME— OPPOSITE 

The student is directed to rmderline whichever word at the right is correct. 
What response sets is such a test subject to, if it is given with a time limit? 
How could a test calling for similiir abilities be designed which would be less 
affected by response sets? 

6. The Nelson-Denny Reading Test for high school and college contains a vocabu- 
lary section in which the subject chooses which of five words means the same 
as a key word. There are one hundred items. Ten minutes is allowed. The score 
is obtained by counting the raunber of correct judgments. With this knowledge, 
how can you explain the performance of a feeble-minded boy who reached a 
score of 25, better than 60 percent of other ninth-graders (15)? 

7. Defend the statement; In analysis of the psychological meaning of a test, the 
directions must be considered as part of the test.” 

How one proceeds in test analysis may be clarified by some examples. 
Figure 10 shows items from tests of intelligence and special ability. The top 
illustration, from one of Thurstone’s tests based on factor analysis, repre- 
sents a rather pure” test. It seems to call for just one type of judgment, 
ability to^visualize how a figure will look when rotated. In addition to 
aptitude,” however, observations show that work methods make some dif- 
erence in the ease with which people obtain correct answers. Some think 
each item through carefully with a trial-and-error approach, mentally trying 
to turn the first figure into eadi position. Others first establish a generaliza- 
tion and try to apply it to each figure. The generalization may be incorrect, 
especially in persons having limited mental development or lack of experi- 

loop must always be to 

the left will make several errors. ’The person who has through some training 



“In each row put a mark under everj' figure which is like the first figure in tlie row.” 
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(From L. L. and T. (r. Thurstonn, Chicago Testa of Primary Menial .4bi7i7i^jr, Spatial, Scicnco Hesoarch As- 
sociates. p. 3. Copyright 1941.) 


“Rank the diagonals A, B,C, D, /'according to their length; that is, wTite 1 in the square 
corresponding to the longest diagonal, 2 in that corresponding to the next longest, and so 
on.’t 



(From D. L. Zyvu, Stanford Srieniijie Aptitude Test, Staufonl Uiiiveraity Presa. Uedintxl about onc-third.) 


“Calculate the weight of a steel plate one foot square, inch thick, with a rectangular 
hole which measures 3 inches by 8 inches. Steel weighs 0.3 pound per cuhic inch.” 

(A) 1.8 (B) 36 (C) 14.4 (D) 5.4 (E) 18 

(From B. V. Moore, C. J. Lapp, and C. H. Oriinn, Enaineerina and Phynral .Icimcr AplUiidr Tad, The Psy- 
eliological Corporation.) 


“Read each statement and mark the answer space which has the same number as the answer 
which you think is BEST.” 

Farmers rotate crops because 

1. Variety is the spice of life. 

2. It confuses the plant pests. 

3. It helps maintain soil fertility. 

4. It gives the farmer a balanced diet. 

(From L. M. Torman and Q. McNemar, Terman^MeNemar Teat of Mental Ability, Form C, World Book 
Company.) 


"In each row, put an X in the box in the picture that is MOST DIFFERENT." 



(From R. N. McMurry and J. E. King, SRA Non-Verbal Form, Form AH, Science Research Asaodatea, items 
43 and 44. Copyright 1947.) 


Fig. 10. Specimen items from tests of ability. 
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learned the concept “clockwise” can quickly decide, for each figure, whether 
the hook is “clockwise” or "counterclockwise” and solve the problem with- 
out error. For such people, then, high scores indicate education rather than 
endowment alone. In addition to ability factors, the test is also influenced 
by the subject’s patience, carefulness, and tendency to guess. Some pupils 
are more painstaking than others in looking for all the correct answers. 

The selection from the Zyve test is thought of by Zyve as measuring 
"caution, i.e., the tendency of the student to pause to investigate before 
adopting a method of behavior” (52). Subjects who guess at the answer 
will usually be fooled by optical illusions in the figure, whereas those who 
measure before answering should be more successful. Zyve feels that this 
attitude of depending on measurement rather than judgment is important 
for students of science, Tlie test might measure “attitude in taking the fest 
of obtaining objective data for making a judgment”; it is not certain that 
the same attitude would be present in other situations. A subject who is 
in a hurry, perhaps because he has many tests to take during a day at a 
guidance center, may be less accurate than he would be at other times. Still 
another factor affecting test score is the precision with which the subject 
measures. Some of the diagonals differ by only tiny amounts, so that even a 
person with a data-seeking attitude can obtain wrong answers unless he is 
also precise in measuring. 

One of the most valuable ways to understand any test is to administer it 
individually, requiring the subject to work the problems aloud. A score 
for these experimental subjects can not be compared with the standard 
norms. But the tester learns just what mental processes are used in solving 
the exercises, and what mental and personality factors cause errors. A test 
of reasoning, for example, is clearly valid if all subjects who reason ac- 
curately get the right answers, but it is a poorer test if some persons fail only 
because they lack information required to solve the problems. 

8. Analyze each of the other illustrations given in Figure 10 to decide what factors 

might affect the score of a college freshman taking the tests. 

The name of a test is often misleading. Different tests of intelligence con- 
tain different sorts of items; therefore they must measure, in part, different 
things. Names are convenient, but the test user must be aware that the only 
way to know what he is measuring is to study the test items and the test 
structure. "Introversion” in one test is not the same as "introversion” in 
another. "Knowledge of arithmetic” in one test may involve speed in simpip 
operations, but in anotlier may require power of arithmetical reasoning. 

A further problem in knowing what a test measures is that in different 
groups different factors account for individual differences on the test. A 
test calling for solution of scientific problems might measure intelligence in 
a group where the problems were unfamiliar. If some students had learned 
about the problems through science courses, differences in amount of science 
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studied would be largely responsible for differences in score. A Japanese 
university includes in its entrance examination an American test of “ability 
to follow directions,” with minor changes but still printed in English. Among 
American adolescents, the test shows differences in general mental ability. 
Among the selected group of Japanese, it becomes a measure of proficiency 
in English — which is important in Japanese higher education where students 
must read much material in English. The extent to which a factor influences 
differences in score is proportional to the range in that factor among the 
group tested. The Japanese vary more than the Americans in ability to read 
simple English. That ability has correspondingly high weight in the test 
when it is given to Japanese. 

9. Steadiness is measured by having the 
.subject put a stylus in a series of suc- 
cessively smaller boles without touch- 
ing the sides. An electrical contact 
between the stylus and plate indicates 
an error. On a device similar to that 
in Figure 11, schizophrenics earned 
significantly poorer scores than nor- 
mals (32). Why might this not in- 
dicate that these schizophrenics are 
less able to coordinate their muscles 
in hand movements? 

10. A school system selects high-school 

teachers by a series of criteria, one of Fig. 11. Steadiness test. (Courtesy 
which is “knowledge of current af- Marietta Apparatus Co.) 

fairs.” It uses as a measure the Time 

Current Affairs Test, giving it to applicants before it is published in Time 
magazine, and without letting them know what test is to be used. Discuss the 
validity of the test in this use. 

11. A set of red, blue, and yellow tags is prorided to be sorted into red, yellow, 
and blue boxes. Number of cards sorted in two minutes is measured. Among 
the abilities required for a good score are understanding of directions, color 
di.scriminution, sustained attention, and motor speed. Which factors would 
influence scores most within each of the following groups: two-year-olds, 
six-year-olds, college students? 

Empirical Validity 

If a test correlates with some other variable of known validity, we can say 
that in part it measures the same thing. Empirical validity is studied by com- 
paring test results witli a criterion known to measure some characteristic of 
importance. For example, if we wish to know whether a test will identify 
successful salesmen, we might give it to new employees of a wholesale hard- 
ware concern and after six months conelate the test scores witli the total 
amount of hardware each man has sold. The criterion, “amount sold in six 
months,” is an index of success in selling. The size of the correlation indi- 
cates how well the test predicts this criterion. If the test does not predict, it 
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is not a measure of “probable success in selling hardware” in this situation, 
and is invalid for selecting salesmen. It will be noted that empirical studies 
rarely clarify what psycliological factors are represented in a test, but rather 
define the test’s uses and limitations in practical terms. 

The principal diflBculty in empirical studies is the choice of a suitable cri- 
terion. If the criteiion does not really represent “selling success,” the test has 
not been given a fair trial. Let us look at the weaknesses of the criterion sug- 
gested. In the first place, it represents only the wholesale hardware business, 
so that at best we can judge the test for only this one use; additional empiri- 
cal studies would be required if the test were being considered for hiring fire 
insurance, automobile, or machine-tool salesmen. Although amount sold ap- 
pears to be a fair basis for judging success, it is probable that some men were 
assigned more desirable territory than others, so that sales would not reflect 
ability alone. Even if we attempt to control this by comparing each man’s 
sales with normal sales in his territory, we have not considered the possible 
effect on business of seasonal factors, such as poor crops in one region. Still 
another problem is that sales alone may not be what we desire from a sales- 
man. A high-pressure salesman may build up high total sales on a first trip 
but by ovei’selling create problems which will eventually harm the firm’s 
business. It is important to realize that even a criterion which looks reason- 
able may have faults. 

A common type of criterion is tlie rating or grade. Aptitude tests are vali- 
dated against marks earned in school. Industrial tests are validated against 
ratings by supervisors. These ratings are weak criteria, since the judge may 
not know the facts about the person, and different judges disagree. Methods 
of improving ratings are discussed in Chapter 18, but even the best ratings 
Me only partially valid. When a test fails to predict a rating, one cannot say 
whether this is the fault of the test or of the rating. 

Designers of new tests frequently validate their instruments by correlating 
them with established tests. Group tests of intelligence, for example, have 
frequently been correlated with the Stanford-Binet intelligence test, which 
itself has been thoroughly studied both empiiically and logically. A test 
which shows a high correlation with the Binet test measures “whatever the 
Binet test measures” and may be relied upon for the same purposes. This 
procedure is often relied upon when no non-test criterion is readily avail- 
able. Its weakness is that we must be very sure of the meaning of the test 
used as a criterion. There is little merit in showing that three questionnaires 
of “neurotic tendency” correlate highly, if scores on all the tests are deter- 
mined mostly by the subject’s ability to see through the purpose of the test 
and give “desirable” answers. 

Table 3 shows the frequency with which various criteria are used in stud- 
ies reported in one journal. 

12. What are the weaknesses of the procedure described in the following quota- 
tion from a handbook on testing? 
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“Generally speaking, the validity of the test is best determined by using 
common sense in discovering that the test measures component abilities 
which exist both in the test situation and on the job. This common-sense 
approach to the problem of validity can be strengthened greatly by bas- 
ing the estimate of the component of the job on a systematic observation of 
job analysis” (quoted in 37). 

13. A test of preschool children is validated in three ways. (1) Intelligence is 
defined as ability to learn responses with which one has had no previous e.x- 
perience. The test items are examined and found to fit this definition. (2) 


Table 3. Validating Criteria Used in Studies Re- 
ported in Journal of Applied Psychology, 1 946-1 947 


Criterion 

Frequency 

Ratings by instructor or superior 

14 

Other test of same function 

13 

School or training grades 

11 

Clinical diagnosis 

S 

Laboratory or clinical measure of same function 

2 

Job turnover 

2 

Objective measure of job performance 

1 

School examination 

1 

Accident rate on job 

1 

Interview wllh subject 

1 


Scores on the test, given at age three, are found to be coirelated with read- 
ing skill and vocabulary knowledge at the end of the first grade. (3) Scores on 
the test, given at age three, are found to be con'elated with scores on the 
Stanford-Binet test given at age 16. 

a. What possible uses of the test are warranted, on the basis of each of these 
studies? 

b. Would it be possible for a test to show high validity by method (2), and 
not to be found valid by the other two procedures? 

14. A "test of mechanical aptitude” is found to have a correlation of .65 with 
success of applicants in learning to operate and service a complex machine. 
Logical analysis shows that the test is only partly a measure of mechanical 
aptitude, but is also much influenced by various response sets, ability to under- 
stand the rather diflScult vocabulary used in stating questions, and knowl- 
edge of arithmetic operations. Should it be used? 

15. “Spelling ability has been found in many studies to be a good predictor of 
clerical ability. We have published a new spelling test. We may assume that 
it \vill predict clerical ability.” What is the possible error in such a conclusion? 

16. Criticize each of the following criteria. 

a. Ratings of practice teachers by their supervisors, as an index of teaching 
ability. 

b. Number of accidents a driver has per year, as an index of driver safety. 

c. Number of accidents a driver has per thousand miles, as an index of 
driver safety. 

17. A study-habits inventory asks such questions as “Do you daydream when you 
should be studying?” 

a. What criterion would you use to determine empirically whether the in- 
ventory really measures study habits? 
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b. Whiit criterion would you use to determine whether the inventory predicts 
success in college? 

c. Which study would be best to show that the test is valid? 

18. Criticize the procedure indicated in the following report of a study of success 
of teachers college students. 

“The correlation between all thirty of the [predictor] variables and the 
[school] superintendents’ ratings was only .17, but that between the vari- 
ables and marks earned during four years of college was .79. Since collep 
marks were predictable on the basis of the thirty variables and ... the 
superintendents’ ratings were not, the marks were substituted for the rat- 
ings as a criterion of success” (cited in 19) . 

Validity coefBcients are rarely as high as we would like. Table 4 lists typi- 
cal validity coefiBcients of various sizes from studies reported in the past. 
Rarely does a validity coefficient rise above .70, although this gives far from 
perfect predictive accuracy. We would be better satisfied if our coefficients 
were higher, but any positive correlation indicates that predictions from the 
test will be more accurate than decisions made without data. Whether a 
validity coefficient is high enough to warrant use of the test depends on such 
practical considerations as the validity of methods already in use ( such as 
interviewing), the urgency of improved selection, and the time and money 
available for testing. 

In interpreting any correlation coefficient, the range of the group studied 


Table 4. Validity Coefficients of Various Sizes 


Cooffi- 

ciant 

Test 

Criterion 

Sample 

.04 

Otis Self-Administering 
(intelligence) 

Rating of success 

40 factory supervisors (39) 

.18 

tfow Supervise? (judgment) 

Rating of success 

40 factory supervisors (39) 

.24 

Cardall Praclical Judgment 

Rating of judgment 

91 student retailing clerks 
(42) 

.31 

Seashore Musical Talent 

Grades in applied music 

59 women students In col- 
lege of music (30) 

.42 

Otis Beta (inteiligence) 

Rating of success 

297 supervisors, several 
plants (41) 

.45 

Bennett Mechanical Compre- 
hension 

Rating of success 

297 supervisors, several 
plants (41) 

.55 

Ferson-Stoddard (law aptn 
fude) 

Rrst-semester grades In law 
school 

395 law students (20) 

.57 

ACE Psychological (scholastic 
optitude) 

Freshman marks 

97 freshmen In junior col- 
lege (34) 

.59 

Iowa Silent Reading, 'Vote 

Specially designed test of 

High-school pupils; number 


of comprehension" score 

rate of comprehension 

uncertain (7) 

.66 

Mental and psychomotor 

Graduation or elimination 

1017 Afr Force trainees 


tests (combined) 

from pilot training 

(18) 

.79 

Mental and achievement 
tests combined with high- 
school grades 

Freshman marks 

97 freshmen in junior col- 
lege (34) 

.82 

Orthorater, visual acuity 
both eyes 

Clinical test of acuity 

95 adults In industrial eye 
clinic (17) 

.92 

Revised Beta (non-verbal 
Intelligence) 

Wechsler test of mental 
ability 

1 68 prisoners (35) 
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must be considered. The correlation is smaller in a selected group than in a 
group containing a wider range of ability. Perhaps the test How Supervise? 
which correlated only .18 with success in supervision (Table 4), would give 
a higher validity in selecting good supervisors from a random group of fac- 
tory workers. The group tested, composed only of people already in super- 
visory work, had probably already been screened to eliminate men of very 
poor judgment. The effect of range is discussed at greater length in Chap- 
ter 11 (pp. 260 ff.). 

A few current reports suggest a new concept of validity, called factorial 
validity. A test has high factorial validity if it is a pure measure of just one 
type of ability. Most mental tests used in school are far from pure; instead, 
they require a combination of many different abilities, such as verbal ability, 
reasoning, reading, mental speed, and so on. It is po.ssible that pure tests, 
each measuring one ability, will eventually permit the tester to obtain better 
predictions (26). Pure tests are constructed on the basis of factor analysis, 
which will be discussed in Chapter 9. The most exten.sive discussion of fac- 
torial validity appears in a report on Air Force selection edited by Guil- 
ford (27). 

19. In Table 4, what characteristics seem to be associated with te.sts having high 
coefficients? 

20. What characteristics are associated vnth criteria yielding high coefficients? 

21. How can you account for the low size of the first six coefficients, other than 
by assuming that the tests fail to measure what they are intended to measure? 

Reliability 

No test can have validity unless it measures accurately. The accuracy of 
measurement is expressed in the reliability coefficient. Any test score is a 
somewhat inaccurate measure, because many errors creep in when we take 
a small sample of a person’s behavior. If we wish to measure ability to add, 
we present the subject with many items drawn from all the possible addition 
problems. We obtain a sample of his work on those items, at a particular 
time, assuming that this sample shows what his work will be like at other 
times. The measure obtained by a particular set of items will give an errone- 
ous score if the combinations included happen to be those especially easy for 
the person. The sample of behavior will not be representative if we test him 
on an “off” day instead of taking our sample at a more favorable time. Such 
errors of measurement, together with all the small changes in score resulting 
from “lucky guesses” and the like, may result in a misleading impression of 
the subject. The reliability coefficient shows the extent to which errors of 
measurement influence scores on the test. 

The complexity of factors which may influence scores on a test is indicated 
in Table 5, prepared by R. L. Thorndike. Reliability coefficients estimate the 
extent to which some or all of the factors in sections II, III, IV, and V of the 
table cause random changes in score. 
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Table 5. Possible Sources of Differences in Performance on a Test 
(48, pp. 102-103) 


I. lasting and general characteristics of the individual. 

A. General skills and techniques of taking tests. 

B. General ability to comprehend instructions. 

C. level of ability on one or more general traits which operote in a number of tests. 

II. lasting but specIBc chorocteristics of the individual. 

A. Specific to the test as a whole (and to parallel forms of it). 

1, Individual level of ability on traits required in this test but not in others. 

2. Knowledge and skills specific to particular form of test items. 

B. Specific to particular test Items. 

1. The "chance" element determining whether the Individual does or does not know a particular 
fact. 

III. Temporary but general characteristics of the individual (factors affecting performance on many or all 
tests at a particular lime). 

A. Health. 

B. Fatigue. 

C. Motivation. 

D. Emotional strain. 

E. General test-wiseness (partly lasting). 

F. External conditions of heat, light, ventilation, etc. 

IV. Temporary and specific characteristics of the Individual 

A. Specific to the test as a whole. 

1. Comprehension of the specific test task (Insofar as this is distinct from l-B). 

2. Specific tricks or techniques of dealing with the particular test materials (insofar as distinct 
from ll-A-2). 

3. level of practice on the spedftc skills involved, especially in psychomotor tests. 

4. Momentary "set" for a particular test. 

B. Specific to particular test Hems. 

1. Fluctuations and idiosyncrasies of human memory* 

2. Unpredictable fluctuations in attentlcMi or accuracy. 

V. Variance not otherwise accounted for (chance). 

A. Luck in the selection of answers by "guessing." 


22. Locate each of the following .sources of change in score in Thorndike’s clas- 
sification. 

a. A student breaks his pencil and loses test time while obtaining another. 

b. An industrial worker who has been in this country a short time mis- 
understands an important phrase in the instructions for a performance test. 

c. A “hillbilly” is imable to answer correctly a question from im intelligence 
test about the purchase of a railroad ticket. 

d. A suspicious patient refuses to cooperate with a test and gives perfunctory 
answers. 

e. A student guesses at every item of which he is uncertain. 

General Principles 

Before considering methods of estimating the reliability of a test, a few 
general principles may be discussed. The principles are as follows: 

1. The reliability coeflBcient depends on the length of the test. 

2. The reliability coefficient depends on the spread of scores in the group 
studied. 

S. A test may give reliable measures at one level of ability, and unreliable 
measures at another level. 
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The importance of lengthening tests is that, with every question added, 
the sample of performance becomes more adequate. One, and only one, ad- 
dition problem is a very poor sample of a person's ability, since we are quite 
likely to hit on a number combination particularly hard or easy for him. If 
we ask more and more questions of the same general sort, we are far more 
likely to obtain a good estimate of his general ability on addition problems. 
Longer tests also are less influenced by guessing. If a test has only five 
multiple-choice items, a few people might get all the items correct by guess- 
ing, but in a fifty-item test, variations due to guessing tend to cancel out. 


Suppose that a test 
has a reliability of .40. 
Tli5 Spearman-Brown 
formula, given at 
right, estimates the re- 
liability of the score 
from a similar test n 
times as long. 

To determine the re- 
liability of a test twice 
as long as the original 
test, substitute in the 
formula n = 2, 
r = .10. 

Suppose the original 
test is reduced to only 
half its original length. 
The reliability of the 
short test is computed 
using n = |. 


r» 


nr 

1 4- (n — 1) r 


where 

r is the original reliability; r„ is the reliability 
of the test n times as long 


Tn = 


2(.40) 


1 + a) .40 
.80 
1.40 


= .57 


iG40) 

1 -H (i - 1) (.40) 

.20 .20 

1 - |(.40) .80 “ ■ 


Computing Guide 5. Use of tlie Spearman-Brown formula. 


Short tests can be made more accurate by lengthening them. The Spear- 
man-Brown formula, presented and illustrated in Computing Guide 5, esti- 
mates what reliability would be obtained if a test were lengthened or short- 
ened. The formula assumes that when we lengthen the test we do not change 
its nature. Extreme increases in test length, however, introduce effects of 
boredom and actually reduce reliability. Unless one is careful, added items 
may not cover the same topic or ability as the original test. 

The reliability of every score one intends to interpret needs to be studied. 
Because short tests tend to be unreliable, part-scores based on a few test 
items are of limited value. Unfortunately, many testers, knowing that a test 
as a whole is reliable, place faith in part-scores which are unreliable. Fre- 
quently authors fail to publish the reliability of subscores even though they 
advise the user to consider the subscores separately. 
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23. A spelling test of 30 words lias a reliability of .80. What reliability might be 
expected if 90 words were used? 

24. The California Test of Personality containing 144 items is reported to have 
a reliability coefficient of .933. What would be the approximate reliability of 
the subtest Sense of Personal Worth, which contains 12 items? (Use the 

formula, with « = -i. This procedure underestimates the reliability because 

JL A 

assumptions of the formula are not satisfied. The exact reliability of the sub- 
test score has not been reported. ) 

Errors of measurement are most important when there are small differ- 
ences in h ue ability among the people being compared. In hiring one person 



Fig. 1 2. Changes in the reliability coefficient as a test 
is applied to an increasingly uniform group. 

out of a group of applicants, the final decision often hiniges on a difference 
of a few points between the best and next-best man. Slight errors of measure- 
ment in such a case might result in hiring the poorer man. If one is screening 
general applicants for factory work merely to rule out the incompetent pros- 
pects, a less refined test will be satisfactory. An error of a few points in score 
cannot conceal the gross deficiencies in ability which distinguish the “hope- 
less” from the average run of workers. A test which has a satisfactorily high 
reliability coefficient for use with groups containing wide differences in abil- 
ity may be unsatisfactory for comparing people in a highly selected group. 
Figure 12 shows how the reliability coefficient of a test drops when the test 
is applied to groups having a smaller range of differences. The drop in re- 
liability indicates that, as the range decreases, a larger and larger percentage 
of the differences among individuals is due to errors of measurement. 
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25. A nationally known test of intelligence has a reliability of .82 for a representa- 
tive population of college freshmen from all colleges. For this group the s.d. is 
9.8. In a certain college which draws largely from superior high schools and 
has severe admission standards, the s.d. on this test is 7.6. If the test is equally 
reliable over the entire range of scores, what will its reliability coefBcient be in 
this college? (Estimate from Figure 12.) 

Because test reliability is usually reported as a single coeflBcient, it may be 
erroneously assumed that a test with a high coefficient is accurate for all 
types of people. But many tests are reliable only at certain levels of difficulty. 
The Gates Reading Survey Test, for example, is a reliable test, and gives 
good estimates of reading skill for pupils in most grades. But it attempts to 
measure pupils in any grade from third to tenth. When third-graders take 
the; test, so much of it is difficult that they do a great deal of guessing. As a 
result, scores in the third grade give an inaccurate picture of individual dif- 
ferences. Tests, no matter how reliable, give inaccurate measures for pupils 
whose scores are near the chance level, and some tests give inaccurate meas- 
ures of the extremely high ranges of talent. 

Figure 13 shows a scatter diagram in which are plotted the scores of lit- 
erate, non-defective Navy recruits who took a pitch-discrimination test twice. 
If the test is accurate, the two scores for each man should be nearly equal. 
The test consisted of 100 pairs of tones; in each pair, the man reported 
whether the second tone was higher or lower. A score of 50 would be ob- 
tained by pure chance. From the scatter diagram, high scores appear to be 
fairly reliable. Men scoring 85 on the &st test fell between 72 and 95 on 
retest. But men scoring near the chance level ( e.g., 55 ) scattered over a wide 
range on the retest (40 to 87). The upcurve of the regression line at the left 
is especially interesting. People with very low scores on the first test often 
did well on the retest. A score of 25 is so far below 50 that it probably did 
not occur by chance. Probably men with such low scores misunderstood 
directions on the first test and judged the first of the two tones as higher or 
lower. Following directions correctly on the retest would account for a shift 
from 70 items wrong to 70 items right ( 23 ) . 

A general rule of test selection is that the test must be appropriate in diffi- 
culty for the decision to be made. Figure 14 .shows several distributions of 
test scores for the same group. Test A is very easy for the group; it gives a 
skewed distribution. It is probably unreliable for the group as a whole, since 
only a few points’ VMiation would cause a subject to drop from the top of 
the group lo”Sie^average. Furthermore, it does not distinguish between the 
persons tying at 100 even though they are probably not equally superior. 
The te7t may be quite satisfactory for measuring differences at the lower end 
of the group. Test B is difficult. The distribution is again skewed, but this 
Hmfi high scores are spread out and differences at the low end of the scale 
are too small to distinguish individuals reliably. A normal distribution, such 
as that for Test C, has the advantage of spreading out cases equally at both 
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20- 24- 28- 32- 36- 40- 44- 48- 52- 56- 60- 64- 68- 72- 76- 80- 84- 88- 92- 96- 
23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 up 
Score on First Test 


Fig. 13. Test-retest scores of naval recruits on a pitch-discrimination test (23). 

ends of the scale.JJnder tliis condition, scores at the two ends are more likely 
to be equally reliabre."F6r this reason, tests yielding roughly normal distribu- 
tions are preferred where it is necessary to distinguish equally well at all 
positions on the scale. A liormal distribution is not required for a test. Test D 
illustrates a type of distribution for a special purpose where only very_coarjse 
— but correct — grading was desired. Tire Army wished to screen all men 
having a mental age below 11 out of its induction line widiout giving a full 
mental test. A set of single-question tests was prepared, each question being 
one that the average person of mental age 11 could answer. Each man was 
asked one of the questions; if he failed, he was given another. Those who 
failed both were screened for further study ( 31 ) . Such a test does not meas- 
ure individual difFerences within either the passing or failing group, but it 
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Fig. 14. Scores on several tests given to 



the some group. 


gives an accurate measure at the one level where measurement is needed. 
Such tests are like tire "go-no go” gauge used in industry to determine 
whether a part has been machined suflBdently. 

26. Considering distributions A, B, and C in Figure 14, which would be most de- 
sirable to have in each of the following cases? 

a. A psychologist wishes to study the relationship between voting habits and 
an attitude test indicating liberalism. 

b. A college wishes to select those of its freshmen needing help in reading. 

c. Pupils to compete for national scholarships are to be selected. 

d. An employer wishes to select the best statistician from a group of appli- 
cants. 

27. The California Test of Personality, Elementary, contains several subtests, one 
of which is Feeling of Belonging. A low score on this test is said to indicate 
maladjustment. According to the test manual, the percentile rank correspond- 
ing to each possible score is as follows: 

Score 123456 7 8 9 10 11 12 

P.R. 1 1 1 1 1 5 10 15 25 40 65 90 

a. What is the shape of the raw-score distribution? What does this distribu- 
tion imply regarding the usefulness of the test? 

b. How would a boy’s standing in the group change if his score changed two 
points? 

28. In World Series baseball, some pinch hitters reach batting averages as high as 
.750, whereas the best regular players rarely exceed .400 for seven games. 
How can this be explainer What principle regarding reliability does it illus- 
trate? 

Methods of Determining Coefficients 

There are three types of reliability coefficient: the coefficient of equiva- 
lence, the coefficient of stability, and the coefficient of stability and equiva- 
lence. The coefficient of equivalence indicates how much scores fluctuate 
from form to form of the same test. The coefficient of stability measures the 
fluctuation on the same questions from one time to another. Both form-to- 
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form and time-to-time fluctuations are considered in the coeiflcient of sta- 
bility and equivalence, which is lower than the other two. When a “reliabil- 
ity coeflBcient” is reported, it is important to consider which type it is, since 
the three coefficients have different meanings ( 16; cf. Fig. 15 ) . 

The coefficient of equivalence indicates how precisely the test measures 
the person’s performance at the particular moment. It estimates how much 



Fig. 15. Comparison of validity coefficient and three reliability coefficients. 


his score would change if a different sample of questions testing the same 
ability had been used. This coefficient is usually estimated by the split-half 
method, as shown in Computing Guide 6, or by an alternate procedure using 
the Spearman-Brown formula. Tlie split-half method gives misleading results 
unless the two half-tests are just as equivalent as parallel forms of the same 
test would be. The halves must have nearly equal standard deviations. The 
halves must be similar in content; it would not be fair to split an intelligence 
test containing some arithmetic questions so that most of those items fell 
into one half. It is common practice to compare the odd items ( 1, 3, 5, etc. ) 
with the even items (2, 4, 6, etc.), but this may give halves which are not 
equivalent (14). A variation known as the parallelrsplit is recommended to 
make the half-tests more alike. A group of papers is analyzed to determine 
how many persons pass each item. Items are divided into two groups such 
that items in the two halves are matched in difficulty and in content. A sec- 
ond group of papers is then scored on the two half-tests and the formula of 
Computing Guide 6 applied ( 13 ) . 

The Kuder-Richardson formulas also estimate the coefficient of equiva- 
lence. ^e most widely used formula is presented in Computing Guide 7. 
This gives a good estimate when all the items in a test represent a single 
general factor (see Chapter 9). If the items in a test cover many different 






1. In order to deter- 
mine the reliability 
of a test by the 
split-half method, 
the test is divided 
into two lialves, as 
nearly alike as pos- 
sible in difficulty 
and content. 

2. Each paper is 
scored on the two 
parts separately, so 
,that one has an ar- 
ray of scores like 
that at right. 


3. The standard devi- 
ation of each col- 
umn is computed, 
by the procedure of 
Computing Guide 
2 . 


Name 


Score 



First 

Second 

Total 


Half 

Half 



(a) 

(h) 

(/) 

Adams, J. 

22 

19 

41 

Allen, R. 

30 

22 

52 

Arthur, S. 

25 

27 

52 

Ashton, F. 

16 

18 

34 

etc 





First half: = 3.0 

Second half: *6 = 3.2 
Total: St = 5.5 


4. The numbers are 
substituted in 
Guttman’s formula 
(28): 


r 



Sg^ + St A 
S,* / 


Sg and Sb are the 
sigmas of the half- 
tests, and St is that 
for the total, r is 
the coefficient of 
equivalence (relia- 
bility). 


_ - A _ 9,00^M^\ 
30.25 / 

\ 30.25/ 


r = 2 (1 - .636) 
r = 2 (.364) = .73 


The resulting coefficient is a serious underestimate if Sg and St are not nearly 
equal. In that case, a new split must be made. 


Computing Guide 6. Determining the split-half reliability coefficient. 



1. Before applying the Xuder-BicJjarclson method 
to determine the reliability of a test, decide 
whetiier the lest measures just one ability or 
calls for different abilities in different items. 

Unless all the items measure the same general 
ability, the method gives a misleading result. 

2. Make a frecfuen- For a particular 60-item test, the result may 

cy distribution of be: 

scores and compute 
the standard devi- 
ation (s) by the s — 6.2 

method of Comput- 
ing Guide 2. 

3. Make a tabulation Item Number of P Q PQ 

of the number of No. Persons Passing 

people passing each (out of 20) 

item. This gives an 1 I0f III 13 .65 ..35 .23 

array like tliat at ■ 2 /jV/ //// III 13 -90 .10 .09 

the right. (The 3 jjif IMf IMl' ! 16 .80 .20 .16 

tabulation is usu- 4 Cfif III 8 .40 .60 .24 

ally made for a 5 IMT // 12 .60 .40 .24 

large number of 6 IHf Urf JMI 15 .75 .25 .19 

papers — 50 or more 7 ///f 20 1.00 .00 .00 

— but only 20 pa- etc 

pers are used in 

this illustration.) Spg = 9.74 

Compute p, the proportion of the group pass- 
ing the item. 

Determine q, the proportion of the group not 
passsing, by subtracting p from 1.00. 

Multiply p by g, and enter in the pq column. 

Add this column to get Spg. 

4. Note 71, the number 

of items, and sub- 
stitute in the for- 
mula: 71 = 60 



r = 1.01 (1 - .254) 
r = 1.01 (.746) = .76 


The resulting coefficient is a serious underestimate if the test is not a measure 
of a single general ability. The coefficient is a spuriously high estimate if the 
lest is a speed test. 

Computing Guide 7. Determining the reliability coefficient by the Kuder- 
Richardson method (Case III). 
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abilities, the estimate from this formula is low, sometimes much too low. 

The coefficient of equivalence may also be estimated by the parallel-test 
method, correlating scores on two forms of the test given at the same time. 
This shows directly whether a different sampling of items would give the 
same picture of individual differences. 

The coefficient of stability shows the e.vtent to which scores on the par- 
ticular test items are stable over a period of time. It indicates whether a 
sample of behavior taken at one time is typical of behavior at other times. It 
is estimated by the retest method, the same persons being gi^ en the same 
test on two different occasions, having had no opportunity behveen trials to 
study the material covered by the test. The retest method may contain a 
spurious factor due to carry-over from one trial to another. This is especially 
troublesome in tests of reasoning, for example, where the person may re- 
member the solution to a problem so that it no longer tests his reasoning. 

The coefficient of stability and equivalence measures fluctuations from 
day-to-day changes in the person and fluctuations due to the particular 
choice of items in the test. It shows the extent to which the test measures 
stable individual differences in the variables present both in the test and in 
its parallel form. It is estimated by the delayed parallel-test method. Two 
tests containing different questions, but supposedly equivalent, are given to 
the student on two occasions. The correlation between the two tests is taken 
as the coefficient. 

Which method of estimating the amount of error in test scores is most suit- 
able depends on the situation. In a speeded test, it is impossible to determine 
a coefficient of equivalence save by giving an immediate parallel test. The 
split-half and Kuder-Richardson methods must not be used. When this prin- 
ciple is disregarded, as it often is, and split-half methods are applied to speed 
tests, the resulting coefficients are grossly inflated. To show the size of this 
error, authors of one test computed the coefficients shown in Table 6, first 
by the proper parallel-test method, second by the inappropriate split-half 
method. The second set of coefficients is much higher. Some test manuals 
report only such spurious coefficients— the estimates are in error and give no 
useful information about the accuracy of the test. 

If one considers that day-to-day variation is a source of error (as when 
trying to measure "intelligence,” which is supposed to be constant), he will 
use a coefficient of stability or stability and equivalence. This is the case in 
most practical applications of tests. If the day-to-day fluctuations are con- 
sidered to be “real” variables rather than error, a coefficient of equivalence 

is used. , 

The Deti-oit Beginning First-Grade Intelligence Test provides an example 
of both types of coefficient The test is unspeeded, and may properly be split. 
The split-half coefficient is .91, which shows that tlie test is an accurate meas- 
m-e of beginning pupils. But when the retest method is used, the coefficient 
dr-ops to .76 (correlation between scores at start of year and scores just four 
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Table 6. Inflation of Reliability Estimates When Split-Half 
Method Is Applied to a Speeded Test" (6, p. C-6) 


Grade 

Reliability 
Properly 
Determined 
by Correlating 
Two Forms 

Estimates by Split-Half Method 
Obtained to Demonstrate 
Spurious Increase 

Form A 

Form B 

8 

.77 

.990 

.996 

9 

.83 

.991 

.989 

10 

.93 

.996 

.985 

n 

.86 

.992 

.993 

12 

.92 

.996 

.969 


® One of the Dsfferantini Aptitude Tests. 


months later). So pupils placed in a “fast” group or a “slow” group at the 
start of the year might shift in ability so rapidly that such a classification 
would be quite unjust a few months later. The test is accurate enough to 
measure what can be e.xpected of pupils in the first weeks of school, but 
teachers must know that scores are unstable and that some pupils will 
change a great deal before the term ends. 

The procedures for studying reliability are summarized in Table 7. 


Table 7. Summary of Characteristics of Three Reliability Coefficients 


Name and Method Error When Assump' Questions Answered 

of OefermlnoHon Assumptions lions Are Violated by the Coefficient 


Coefficient of equivo/ence 
Split-half 


Kuder-Richardson 


Immediate parallel 
test 


Coefficient of stability 
Retest after an interval 


Coefficient of stability 
and equivalence 
Parallel test after an 
interval 


Halves must be 
equivalent. 


Test must measure a 
single factor. 


Tests must be 
equivalent. 


No opportunity for in- 
creasing ability by 
practice during 
interval. 


Tests must be equiva- 
lent. No opportunity 
for practice. 


Coefficient for speed tests 
falsely high. For other 
tests, coefficient Is too 
low if halves are not 
equivalent. 

Coefficient for speed 
tests falsely high. For 
other tests, coefficient 
is too low if items 
measure many factors. 

Coefficient shows degree 
of equivalence, rather 
than accuracy of either 
test. 

Practice on function tested 
reduces coefficient. 


Coefficient underestimates 
accuracy of test if as- 
sumptions are violated. 


How precisely does 
the test measure? 
How adequately 
does It sample all 
the Items that might 
be included? 


How stable is meos- 
urement with the 
test? 


How would the sample 
of behavior at one 
time correspond to 
results from a similar 
sample at another 
time? 
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29. A test of 30 items is given to 20 students, with the results shovvn below. Each 
+ indicates a correct answer, and each 0 an incorrect answer. Detennine the 
coefficient of equivalence by the split-half method (odd-even) and by the 
Kuder-Richardson formula. 

30. A teacher gives a standardized test of knowledge of scientific facts to her class 
in chemistry. Several students make scores lower than she had expected. 

a. She asks herself, “Could it he that I gave a form of the test which included 
many questions these particular pupils happened not to know? Would their 
scores have changed much if they had been asked other questions of the 
same type?” Whicli reliability coefiBcient is appropriate to answer this ques- 
tion? 

b. Suppose she asks, “Could the perfonnance of these students be due to the 
fact that they were having an off day? Does a pupil’s score on tests of this 
type vary much from day to day?” Which coefficient is most helpful in 
answering this question? 

31. Which of the factors in Table 5 would lower the correlation between test and 
criterion in each of the following situations? 

a. A test of high-jumping ability is used to select finalists in a track meet. The 
criterion is performance in the meet, two weeks after the trials. 

b. A pencil-and-paper test of mechanical ability is used to predict perform- 
ance of mechanical trainees. Piecework earnings after training are used as 
the criterion. 

32. The following quotation refers to hearing tests for children (29, p. 225) : 

“It is certainly understood diat physical and psychological changes taking 
place fi-om day to day may make tests at two sittings less valid than a 
complete test at one sitting. Dr. Gardner observes empirically that he gets 
worse results on cloudy days than on sunny days.” 

In what sense is the word “valid” used? Can you defend the contrary state- 
ment that scores at two sittings would be more valid than a complete test at 
one sitting? 


Table 8. Representative Reliability Coefficients 


Tejf 

Method 

Coefficient 

Sample 

CEEB Scholastic ApHtude~ 
Verbal 

Odd-eVen 

.96 

500 applicants to colleges 
(n,pp. 29 fF.) 

Kuder Preference Record 

Retest (3 days) 

.93 to .98 

41 graduate students (51) 

(inlerests) 

Adoptability Test (mental 

Parallel test, immediate 

.89 


ability) 

CEEB Scholastic Aptitude^**- 
Verbal 

Parallel test, varied time 

.87 

7396 applicants to colleges 
(ll,p. 60) 

Bernreuter, "self-confidence" 

Odd-even 

.86 

1 00 adolescent boys (22) 

score 

Kuhlmann-Anderson (IQ) 

Parallel test (2 'A yrs.) 

.69 

327 children in Grade 1 at 
first test (1) 

Not reported (3) 

Fels Parent Behavior Scales 
(raltngs) 

Two home visitors, one year 
apart 

.61 

Tests of cheating 

Similar test (12 years} 

.37 

About 1 50 children in 
Grades Vll-VIII at first 
test (33) 


There is no single standard of what is an adequate reliability coefficient. 
Few tests approach perfect reliability. Sometimes short and unreliable tests 
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are valuable for particular purposes, especially for making rapid judgments. 
Generally, higher reliability coefficients are obtained for ability tests than 
for measures of typical performance. Reliability coefficients reported for sev- 
eral tests are presented in Table 8. 

The degree of reliability required varies with the purpose of the test. As 
with validity, the reliability desired is the highest we can get. Validity is the 
goal in testing. The maximum validity a test can have is the square root of 
the reliability coefficient. 


Objectivity 

In general, objective tests are to be preferred to those in which the opinion 
of the scorer'influences the results, but subjectively scored tests are at times 
more flexible or better suited to a given purpose. This flexibility is especially 
advantageous in clinical work. Whenever one permits opinion to affect a test 
score, he makes it impossible to compare his results with those of other test- 
ers, and difficult to compare his subject with the normal person. Objectivity 
is always to be desired if it can be obtained without sacrificing the major 
purpose of tlie test. 

Current practice in group testing favors the multiple-choice form.. Al- 
though this measures recognition of correct ans^vers rather than recall, it is 
satisfactory for many purposes. The College Entrance E.xamination Board, 
for example, reports that in a mathematics achievement test at the college 
level multiple-choice questions had reliability coefficients and correlation 
with grades in later mathematics essentially the same as those for free- 
arrswer questiorrs (11, p. 61). 

American practice, with tests desigrred to be used by difi'erent persons 
with the same results, is in interesting contrast to work in other countries, 
for example the Japanese Ai-my procedrrres for selecting pilots. Itr addition 
to mental and psychomotor tests, an oral examination was given. Subjective 
judgmerrt was used in evaluating test results. In the oral interview, an effort 
was made to avoid standardizing the questions and general approach, so as 
to observe applicants flexibly. Decision to admit the applicant to flying was 
based on an “artistic” judgment by the psychologist, rather than a numerical 
treatment of scores. Typical validity figures, using training records as a cri- 
terion, were reported as .40 to .65. “The validity coefficient for the final 
assessment was qualified by adding that this value was secured ‘if Mr. 
Yoshino does it’ (makes the final assessment). . . . Other ‘e.\perts’ were not 
as good” (24). A similar impression was obtained by Fitts, who studied 
German psychological testing in World War II (21, p. 153) : 

“A marked characteristic of the German testing procedure was the emphasis on 
subjective evaluation of test data. German psychologists stated repeatedly that ob- 
servations of the candidate’s behavior during a test were more important tlian the 
actual score which he earned. . . . One man . . . said that the chief fault of in- 
experienced military psychologists was that they attached too much weight to ob- 
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jective scores and did not pay enough attention to the formation of an intuitive 
impression from observation of the cimdidate’s reactions and expressions. Individ- 
ual examiners were permitted and often encouraged to vary testing procedures and 
to emphasize their favorite tests.” 

Even in types of tests in which the subject must produce a free response, 
methods can be devised which permit fairly objective scoring. Ayres, in 
1912, produced a guide for scoring pupil handwriting (Fig. 16). Samples 
of handwriting representing various levels of quality are given; the teacher 
compares the pupil’s work to find the most similar sample (2). Product- 
rating scales have been used frequently in judging quality of cooking, shop- 
work, and other skilled work. Objective methods have not been completely 
successful in scoring free-iesponse tests, but variation between scorers is re- 
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Fig. 16. Part of Ayres’ scale for scoring handwriting samples. (Courtesy 
Russell Sage Foundation, publisher.) 


duced by guides which show the approved scoring for representative an- 
swers. Noteworthy examples are the scoring manuals for the Stanford- 
Binet test of intelligence (45, 38) and the volume hy Beck (5) on the 
Rorschach test of personality. 

The importance of eliminating variation between scorers was established 
rather dramatically in a series of studies by Starch and Elliott (43, 44). They 
requested a large number of teachers to score a pupil’s composition; the 
grades ranged from 50 percent to 98 percent. Teachers use different bases 
for judgment, and penalties of vaiy'ing severity. Even in geometiy, teachers 
gave grades ranging from 28 to 92 to the same paper. These results are 
extreme, since the teachers were given no instructions for grading, but they 
demonstrate that results will not be comparable unless scoring is stand- 
ardized, 

33. The question "Why should people wash their clothing?” is to be used in an 
oral intelligence test for adults, to test comprehension of common situations. 
Provide a set of scoring standards to indicate what should be considered cor- 
rect. Make your list of characteristics of a satisfactory answer so clear that 
other scorers would be able to agree in scoring new answers. 
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Norms 

The authors of most tests provide tables for inteqireting a raw score in re- 
lation to “normal” performance. These tables are based on computations 
from scores on a norm group, or standardizing group. The data are usually 
presented as means, percentiles, or standard scores, so that the test user, 
knowing the score of a person, can determine how he compares with the 
standard group. Age norms indicate the average performance of persons of 
each age. Grade norms indicate the average performance of children in each 
grade. For personality tests and tests of mechanical abilit)', separate norms 
for men and women are usually needed, since their averages are different. 

If a test is to be used only to establish individual differences within a 
grotip, norms are unimportant. Norms are of little use to the employment 
manager who wishes to hire the brightest.50 percent of a group of applicants. 
Norms are also of little value where a critical score is used ( see Chapter 11 ) . 
If a personnel manager knows from actual trial that persons with scores of 72 
on test A make satisfactory punch-press operators, it is not necessary for him 
to compare applicants with national norms. Where only a single person is 
being tested, as in guidance arid clinical work, it is extremely important 
that scores be interpreted in relation to good noms. A score which seems, 
at first glance, to be high may turn out to be only average when compared 
to a normal group of scores. 

Whether norms are satisfactory depends on three questions: (1) Are the 


Table 9, Standardizing Samples Used by Authors of Representative Tests 




Number of 


Test 

Type 

Norm* 

Cases 

Nature of Cases 

Wechsler-Bellevue 

intelligence, 

individual 

Age; IQ 

1751 

White urban adults representa- 
tive of all occupational levels; 




school children from normol 
and retarded groups 


Henmon-Nelson 

Intelligence, 

group 

Percentiles by col- 
lege years 

5500 

Students of colleges which ad- 
mit hIgh-school graduotes 



without further selection 

Terman-McNemar 

Intelligence, 

group 

Age; percentile 

About 

19,000 

Random sample of pupils In 1 48 
communities in 33 states plus 



307 parochial schools near 
Philadelphia 



Bell (adult) 

Adiustment 

Mean and sigma 
by sex 

46B 

Adults from adult extension 
classes, YMCA counseling 
service, etc.; Los Angeles, 
Chicago, Boston, Madison, 
NJe 

Details not reported 


Bernreuter 

Personality 

Percentile by sex. 

121 to 


separately for 

651 per 




h.s., coll., odult 

group 

Heolthy adult visitors to a 
Minnesota hospital 

Minnesota Multi- 
phasic 
Progressive 

Personality 

Standard score by 

About 

700 

Achievement, 

Percentiles by 

Over 

School children in many states 

elementary 

half-grades 

25,000 



76 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


norms based on a suflBciently large group? (2) Is the standard group repre- 
sentative? (3) Does the standard group resemble the persons with whom 
we wish to compare, our subject? The first two questions are general tests of 
the adequacy of published norms; the third depends on tire immediate use 
intended. Table 9 lists the bases on which norms for several representative 
.tests were established. The numbers of cases used by these authors differ 
greatly. If too few' cases are used, the norms may be inaccurate owing to 
accidental selection of good or poor cases. With more cases, such variations 
.should cancel out. No fixed number of cases is required for dependable 
norms; representativeness is more important than accumulating large num- 
bers of cases which may not be representative. 



0-169 170-201 202-232 233-264 265 and over 

Raw Score, ACE Psychological Examination, 1933 


Fig. 17. The importance of local norms. Shaded bars show score distribu- 
tion for 40,299 freshmen in 203 colleges, according to national norms. Un- 
shaded bars show distribution for 646 freshmen at the University of Chicago. 
(Data from 49, p. 16.) 

The norm group is representative if the persons included are typical of 
tho se o n whom the test is to be used. If the sample is biased so fEat 
all types of peoj^ are not represented equally, the norms are of little value. 
If a test is intended for college students, it is fair to base norms on college 
students; but if the test is to be used with adults generally, the average of 
the college group would be an unfair standard of comparison. If college 
norms are to be used, it is important that all tyjjes of colleges be represented, 
that all geographical areas be represented, and that all ^pej_ of_stu.dents 
within the^^c'ollege be fepr^ented. Occasionally a test author makes the error 
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of accumulating a large number of cases of the sort most readily available, 
and reports, perhaps, “The nonns are based on scores of 2700 sophomores 
taking general psychology at four Western colleges.” Norms such as these -are 
useless unless the tester wishes to know how his cases compare with sopho- 
more psychology students at Western colleges. 

Even though a test has carefully prepared norms, the norms may not be 
adequate for a particular use. Norms for the Wechsler-Bellevue test of men- 
tal ability give helpful information about adults in general. But a boy who is 
above the average score for Iris age, when compared to people in general, 
may be below average among college freshmen. If we wish to predict 
whether he can succeed in college, we need to compare his score with 
Wechsler-Bellevue norms based on college students alone. Local norms are 
needed to predict success in a particular college (cf. Fig. 17). The Kuder 
Preference Record ( 1946 edition ) provides norms based on a large group of 
high-school boys and girls. The interpretation of interest profiles of col- 
lege freshmen, based on these norms, is complicated by the fact that non- 
academic high-school pupils drop out before reaching college so that the 
average college student probably has an “unaverage” profile. College norms 
for die test are in preparation. Local norms may be more useful than national 
norms for many problems. Sections of the country, schools, and ethnic groups 
vary widely in America. A person who is below the norm in a superior group 
may be able to compete successfully in a poorer group. If we wish to predict 
-his standing, we^nust compare him -to those with whom he. associates. 

Before we compare a pupil with a set of norms, we must consider the 
educational policies of the schools where the norms were obtained. In a 
school where some children enter before reaching the age of six, the reading 
and vocabulary norms for nine-yeai-olds will be relativ'‘ly high, because 
some pupils will have had more schooling than nine-year-olds elsewhere who 
could not enter school until age six. In a school which promotes every child 
every year, the average performance of fourth-graders is likely to be lower 
than in a school which holds back the poorer students. 

34. Discuss the representativeness of the groups used in standardizing each of the 
tests in Table 9. 

35. In the freshman class of a particular college which admits all high-school grad- 
uates who apply, the meefian score is at the 65th percentile of the published 
norms for freshmen for the Henmon-Nelson test. What factors might account 
for this deviation? 

36. Outline detailed plans for selecting a suitable norm group for a test of aptitude 
for engineering, to be used for vocational guidance in high schools and col- 
leges. 

37. A psychologist standardizes a primary intelligence test by testing every child 
entering the first grade in San Francisco during a particuhu- year. 

a. For what purposes would these norms be valuable? 

b. Could equally satisfactory norms be obtained without testing every first- 
grader? 
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38. A “music aptitude” test measures such factors as tone discrimination. There is 
evidence that scores are increased by musical training. If the test is to be used 
for advising college freshmen whether to study music, what sort of cases 
should be used to establish national norms (25)? 

39. How would you proceed to get an extremely representative sample of adult 
men in Chicago to use as a standardi 2 ing group for a mental test? Assume that 
you have sufficient research funds to pay each man $1.00 for taking the test. 

Good Items 

In creating a test, the test author tries to select items which will give re- 
liable and valid results. Although the test user has no control over the test 
items, knowledge of the principles of item construction may help him to 
evaluate tests. 

Good test items are unambiguous (with the exception of some personality 
tests where ambiguity is deliberate). There should be only one possible in- 
terpretation of the question, and if choices are offered in an ability test, com- 
petent judges should agree that there is only one acceptable answer. Ques- 
tions should be stated so that the person knows what is wanted; all failures 
will then depend upon inability to answer, rather than lack of understanding 
of the tester’s desires. "Catch questions’’ are undesirable. 


Table 10. Percentages of Bright and Dull Children Passing 
Five Proposed Items for on Intelligence Test® 




Percent 

Percent 



Passing, 

Passing, 



IQ Below 

IQ Above 


Item 

96 

105 

1. Counting backward from 20 to 1 (Year Vlll) 

35 

83 

2. 

□rowing from observation an apple with a 
pencil through it (Year VIII) 

62 

57 

3. 

Repeat six digits (Year X) 

55 

81 

4. 

Binet's suggestion test, 1911 revision (Yeor X) 

80 

89 

5 . 

Drawing simple designs from memory (Year X) 

45 

70 


^ Data obtained by Terman in preparing bis 1916 intelligence test ( 46 , 
pp. 133-134). Percentages were computed for each test on children of die 
age at which the tost is located. 


Good test items measure what the tester wants to measure. This is the 
principal factor in logical validity of a test. With care the test maker can 
“purify” his test considerably. One important method of removing test items 
loaded with irrelevant factors is the internal-consistency test. If an item 
measures what the remainder of tire test does, it should have a high correla- 
tion with the total test. A simple way of studying the item-test relationship is 
to select, in any group, a number of persons with high total scores, and an- 
other group with low total scores. It is customary to compare the top fom'A 
with the lowest fourth of the group. The percentage of each group giving 
each answer is studied; if the item measures what the rest of the test meas- 
ures, far more “good” students than “poor” ones should get the item correct. 
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An item which does not discriminate (i.e., where both groujis are equally 
often correct) does not measure what the test measures. In Table 10, dis- 
crimination data for items of an intelligence test are shown. From these data, 
item 1 is excellent, but item 2 is quite unsatisfactory. 

An even better procedure for selecting items is to test each item against a 
criterion of what the test is desired to measure. If such a criterion is avail- 
able, the persons scoring high and low on tire criterion are compared on 
every item. 

Good items should have difiSculty appropriate to the group tested. The best 
tests for measuring all levels within a group are those in which the average 
item difficulty is near 50 percent. Sometimes the test is made easier, so that 
subjects will not be discouraged by frequent failure. The items may range 
from a few very difficult ones to a few very easy ones. Items which practically 
no one passes are of little value, since they do not tell much about individual 
differences. Items which everyone passes give no information about differ- 
ences; a few such items may be desirable to build morale and give the sub- 
ject a "running start” in the test. Very difficult items are valuable to distin- 
guish between superior persons, and easy items discriminate among people 
far below average. 

40. In Table 10, are items 3, 4, and S good or poor, by the internal-consistency 
test? 

41. Determine the discriminating power of items 1-10 from the data in Problem 
29. Since few cases are available, compare the better half of the subjects to 
the poorer half. 

SOURCES OF INFORMATION ABOUT TESTS 
When confronted with a practical task, the tester must determine what 
tests are available and must select among them on the basis of the criteria 
just given. Accurate information about tests is therefore essential. For obtain- 
ing the names of tests of a particular type, catalogs issued by various pub- 
lishers and distributors are helpful. These catalogs usually give brief descrip- 
tions of the test, and its cost. Many of the leading suppliers of tests are listed 
in Appendix B. Another important source of test listings is Hildreth’s Bibli- 
ography of mental tests and rating scales (2nd ed. ), published by the Psy- 
chological Corporation in 1939. A supplement was issued in 1945. 

The manual of the test should provide the prospective user with the facts 
needed to judge its suitability. A satisfactory manual presents evidence of 
validity and reliability, describes the method of obtaining norms, and pro- 
vides a basis for deciding what the test measures and how to interpret and 
use scores. Unfortunately many manuals do not give all these facts. When the 
facts are given, they are often given without comment, leaving the reader to 
decide whether the norm sample was adequate, whether the criterion for the 
validity study was suitable, and so on. Because considerable experience is 
required to judge the quality of a test vrith certainty, the user frequently 
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wishes an expert opinion on the test. Buros’ Mental measurements yearbook 
(9, 10) is the best single guide for the consumer of tests. Buros has enlisted 
the services of numerous specialists in testing, each of whom reviews tests 
with which he is familiar. The reviews are thoroughly critical, stressing dif- 
ferences between tests and limitations which must be considered in inter- 
preting test scores. Each test is reviewed by several specialists wherever 
possible. Editions of the yearbook were published in 1938, 1941, and 1948; 
it is planned as a continuing project. 

A few scattered sources also provide facts about tests of a particular type. 
Test users may wish to consult the following: 

Review of educational research, “Psychological tests and their uses.” National Edu- 
cation Association. Reviews new developments in testing over a three-year pe- 
riod; tliis feature is contained in the issues of February, 1941, February, 1944, 
February, 1947, and earlier. 

Bennett, George K., and Cruikshank, Ruth N. A summary of manual and mechani- 
cal ability tests. Psychological Corporation, 1942. Reviews experimental studies 
on prominent tests in this field, and gives brief evaluations. 

Moore, Herbert. Experience with employment tests. Studies in Personnel Policy 
No. 32. National Industrial Conference Board, 1941. Reviews tests of all types 
with emphasis on then significance for employment. 

Borow, Henry, and Whittaker, Betty A. Tests in industry. Personnel Service Divi- 
sion, Pennsylvania State College, 1943. 

Kaplan, Oscar N. (ed.). Encyclopaedia of vocational guidance. Philosophical Li- 
brary, 1947, Contains descriptions of many tests, usually written by one of the 
test authors. 

42. How acceptable is the point of view expre.s.sed in the following (quoted in 4)? 

“It is difficult to construct short tests tliat will be statistically reliable and 
valid. On the other hand, it is also difficult to validate a test that would be 
satisfactory to a psychologist while personnel work remains at its present 
level. It seemed logical, therefore, that it would be better to lean in the 
direction of the personnel supervisor’s concept of a good test (short, no 
timing, self-administering, self-scoring, self-rating, easily interpreted, and 
practical) so that he would accept and use tests and thereby improve his 
personnel program sufficiently to make available better criteria by which 
tests can be validated.” 

43. List the facts which you would mpect to be reported in the manual for a group 
test of intelligence which would help you decide whether to use it or a com- 
peting test. 

44. “The time for each test (Stenquist Mech. Apt. I and II) is forty-five minutes, 
the reliability ranges between .67 and .84, and the validity between .66 and 
.84.” How must this statement from a test review be supplemented to make it 
meaningful? 

45. Two reviews are given in Buros’ yearbook (10) for the California Capacity 
Questionnaire. Compare the reviews and attempt to decide why competent 
psychologists disagree in their evaluation of the test. 

SUGGESTED READINGS 

Bingham, W. V. "Selection of tests,” Aptitudes and aptitude testing. New York: 
Harper, 1937, pp. 209-223. 



81 


HOW TO CHOOSE TESTS 

Blomtners, Paul, and Lindquist, E. F. "Rate of comprehension in reading: its meas- 
urement and its relation to comprehension.” /. ediic. Psychol, 1944, 33, 449- 
473. 

Cronbach, Lee J. “Response sets and test validity.” Educ. psychoL Measmt., 1946, 
6,475-494. 

GhiseUi, Edwin B., Harrell, T. W., and others. Reviews of Pennsylvania Bi-Mtmual 
Worksample, in Buros, O. K. (ed.). The third mental measurements yearbook. 
New Brunswick: Rutgers Univ. Press, 1948, pp. 695-698. 

Guilford, J. P. “New standards for test evaluation.” Educ. psychol. Measmt., 1946, 
6, 427-438. 

Jenkins, John G. “Validity for what?” /. consult. Psychol, 1946, 10, 93-98. 

Kent, Grace H. “The ‘Andover’ school-entrance test.” /. educ. Psychol, 1944, 35, 
108-119. 

Terman, Lewis M., and Merrill, Maud A. “Development and standardization of the 
scales — the selection of .subjects,” in Measuring intelligence. Boston: Houghton 
MifiBin, 1937, pp. 12-21. 

Thorndike, Robert L. (ed.). “Problems associated with reliability and reliability 
determination,” in Research problems and techniques. AAF Aviation Psychol. 
Prog. Res. Rep. No. 3. Washington: Govt. Printing OfiF., 1947, pp. 97—118. 

REFERENCES 

1. Allen, Mildred M. “Relationship between indices of intelligence derived from 
the Kuhlmann-Anderson intelligence tests for grade I and the same test for 
grade IV.” ]. educ. Psychol, 1945, 36, 252-256. 

2. Ayres, L. P. A scale for measuring the quality of handwriting of school chil- 
dren. New York: Russell Sage Foundation, 1912. 

3. Baldwin, Alfred L. “The appraisal of parent behavior.” Amer. Psychologist, 
1946, 1, 251. 

4. Baxter, Brent. "Reliability and validity of the Kopas Wage Earner Battery of 
Tests.” /. appl. Psychol, 1947, 31, 39-43. 

5. Beck, Samuel J. Rorschach’s test. 1: Basic processes. New York: Grune and 
Stratton, 1944. 

6. Bennett, George K., and others. Differential abilities tests, manual. New York: 
Psychological Corporation, 1947. 

7. Blommers, Paul and Lindquist, E. F. “Rate of comprehension of reading; its 
measurement and its relation to comprehension." J. educ. Psychol, 1944, 35, 
449-473. 

8. Bloom, Benjamin S. “Test reliability for what?” /. educ. Psychol, 1942, 33, 
517-526. 

9. Buros, Oscar K. (ed.). The 1940 merited measurements yearbook. Highland 
Park, N.J.: The Mental Mea.surements Yearbook, 1941. 

10. Buros, Oscar K. (ed.). The third mental measurements yearbook. New Bruns- 
wick: Rutgers Univ. Press, 1948. 

11. College Entrance Examination Board. 46th annual report of the executive sec- 
retary. New York: College Entrance Examination Board, 1946. 

12. Cronbach, Lee J. “Studies of acquiescence as a factor in the true-false test.” 
J. educ. Psychol, 1942, 33, 401—415. 

13. Cronbach, Lee J. “On estimates of test reliability.” J. educ. Psychol, 1943, 34, 
485-494. 

14. Cronbach, Lee J. “A ease study of the split-half reliability coefficient.” J. educ. 
Psychol, 1946, 37, 473-480. 



82 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


15. Cronbach, Lee J. “Response sets and test validity.” Educ. psyciwl. Measmt., 
1946, 6, 475-494. 

16. Cronbach, Lee J. “Test ‘reliability’: its meaning and determination.” Fsycho- 
metrika, 1947, 12, 1-16. 

17. Davis, C. Jane. “Con'elation between scores on Ortho-Rater tests and clinical 
tests.” }. appl. Psychol., 1946, 30, 596-603. 

18. DuBois, Philip H. (ed.). The classification program. AAF Aviation Psychol. 
Prog. Res. Rep. No. 2. Washington: Govt. Printing Off., 1947. 

19. Eckelberry, R. H., in book review. Ediic. Res. Bull., 1947, 26, 138-139. 

20. Person, M. L., and Stoddard, G. D. “Law aptitude.” Amer. Law Sch. Rev., 
1927, 6, 78-81. 

21. Fitts, Paul M. “German applied psychology during World War II.” Amer. Psy- 
chologist, 1946, 1, 151-161. 

22. Flanagan, John C. Factor analysis in the study of personality. Stanford Univ.: 
Stanford Univ. Press, 1935. 

23. Ford, Adelhert, and others. The sonar pitch memory test-, a report on design 
standard, s. San Diego: Univ, of Calif. Div. of War Res., 1944. 

24. Geldard, Frank A., and Hams, Chester W. “Selection and classification of air- 
crew by the Japanese.” Amer, Psychologist, 1946, 1, 205-217. 

25. Gilbert, G. M. “ ‘Aptitude’ and training: a suggested restandardization of the 
K-D music test norms.” J. appl. Psychol., 1941, 25, 326-330. 

26. Guilford, J. P. “New standards for test evaluation.” Educ. psychoj. Measmt., 
1946, 6, 427-438. 

27. Guilford, J. P. (ed.). Printed classification tests. AAF Aviation Psychol. Prog. 
Res. Rep. No, 5, Washington: Govt. Pilnting Off., 1947, 

28. Guttman, L. “A basis for analyzing test-retest reliability.” Psychometrika, 
1945, 10, 255-282. 

29. Henry, Sibyl. “Children’s audiograms in relation to reading attainment: I. In- 
b'oduction to and investigation of the problem.” J. genet. Psychol., 1947, 70, 
211-232. 

30. Highsmith, J. A. "Selecting musical talent.” J. appl. Psychol., 1929, 13, 486- 
493. 

31. Hildreth, H. M. “Single-item tests for psychometric screening.” J. appl. Psy- 
chol, 1945, 29, 262-267. 

32. Huston, Paul E., and Shakow, David. “Studies of motor function in schizo- 
phrenia: III. Steadiness.” J. gen. Psychol, 1946, 34, 119-126. 

33. Jones, Vernon. “A comparison of certain measures of honesty at eiurly adoles- 
cence with honesty in adulthood — ^a follow-up study.” Amer. Psychologist, 
1946, 1, 261. 

34. Leaf, C. T. “Prediction of college marks.” J. exp. Educ., 1940, 8, 303-307. 

35. Lindner, Robert M., and Gnrvitz, Milton. “Restandardization of the Revised 
Beta Examination to yield the Wechsler type of IQ.” J. appl. Psychol, 1946, 
30, 649-658. 

36. Louttit, C. M., and Browne, C. G. “The use of psychometric instruments in 
psychological clinics.” J. consult. Psychol, 1947, 11, 49-54. 

37. Mosier, Charles I. “A critical examination of the concepts of face validity.” 
Educ. psychol Measmt., 1947, 7, 191-206. 

38. Pintner, Rudolf, and others. “Supplementary guide for the Revised Stanford- 
Binet Scale (Form L) .” Appl. Psychol. Monogr., 1944, No. 3. 

39. Sartain, A. Q. “Relation between scores on certain standard tests and super- 
visory success in an aircraft factory.” /. appl Psychol, 1946, 30, 328-332. 

40. Scates, Douglas E. “Unit costs in the administration of a standardized test.” 
Educ. Res. Bull, 1937, 16, 38-45, 



HOW TO CHOOSE TESTS 


83 


41. Shuman, John T. “The value of aptitude tests for supervisory workers in the 
aii'craft engine and propeller industries.” /. appl. Psychol., 194.5, 29, 185-190. 

42. Smith, Leo F., and Hogadone, E. B. “Some evidence on the validity of the 
Cardall Test of Practical Judgment.” J. appl. Psychol., 1947, 31, 54-56. 

43. Starch, Daniel, and Elliott, E. C. “Reliability of grading high school work in 
English.” Sc/i. Ret)., 1912, 20, 442-457. 

44. Starch, Daniel, and Elliott, E. C. “Reliability of grading high school work in 
mathematics.” Sc/i. Ret)., 1913, 21, 254-259. 

45. Terman, Lewis M., and Merrill, Maud. Measuring intelligence. Boston: 
Houghton Mifflin, 1937. 

46. Terman, Lewis M., and others. The Stanford revision and extension of the 
Binet-Simon Scale for measuring intelligence. Baltimore: Warwick and York, 
1917. 

47. Thorndike, Robert L. "Two screening tests of verbal intelligence.” J. appl. 
Psychol, 1942, 26, 128-135. 

48. Thorndike, Robert L. (ed.). Research problems and techniques. AAF Avia- 
tion P.sychol. Prog. Res. Rep. No. 3. Washington: Govt. Printing Off., 1947. 

49. Thurstone, L. L. The primary mental abilities. Chicago: Univ. of Chicago 
Press, 1938. 

50. Tiffin, J., and Lawshe, C. H., Jr. “The adaptability test.” J. appl. Psychol, 
1943, 27, 152-163. 

51. Traxler, Arthur E. “A note on the reliability of the revised Knder Preference 
Record.” J. appl. Psychol, 1943, 27, 510-511. 

52. Zyve, D. L. “A test of scientific aptitude.” /. educ. Psychol, 1927, 18, 525-546. 



CHAPTER 5 


How to Give Tests 

WHO SHOULD GIVE TESTS? 

Tests should be given only by persons who will give them well. Some tests 
are sufficiently simple for any intelligent adult to give them successfully; 
others are so subtle that months of special training are required before the 
tester can do a fully effective job. In general, group tests require less trailiing 
to administer than individual tests, although there are some exceptions. If 
the tester has no responsibility save to read a set of printed directions, any 
conscientious person should be successful. Where it is necessary to question 
the subject individually and to use follow-up questions when the first answer 
is unclear, great skill and experience are requhed. 

The tester must take pains and precautions if he \vishes to produce results 
which give every subject a chance to exhibit his ability, and which are com- 
parable to results obtained by other testers. He must prepare for the te.sting 
by becoming thoroughly familiar with the test. If he makes errors in admin- 
istering it, tlie scores will not be comparable to the norms. Even fairly simple 
tests usually present one or two stumbling blocks which can be anticipated 
if the tester studies a manual in advance. 

The tester must maintain an impartial and scientific attitude. Testers are 
usually keenly interested in the subjects they work with, and desire to see 
them do well. As a result, beginning testers are tempted to give hints to the 
subject or to coax him toward greater effort. It is the duty of the tester to 
obtain from each subject tlie best record he can produce; but he must pro- 
duce this by his own efforts, witliout luifair aid. The tester must learn to 
control both tendencies to give extra help directly and unconscious acts 
which may give the subject a clue. This is especially a problem in individual 
testing, where each question is given orally. On a mental-test item where the 
child is .supposed to receive only one trial, he may give an answer which 
shows that he did not comprehend the question. The tester will often be 
tempted to repeat the question “since he could certainly have done it if he 
had understood what was wanted”; this must not be done, since the test di- 
rections permit only one trial. Adjustment may be warranted on some border- 
line decisions, however, such as the case in which the result is discarded, 
rather than scored as wrong, because an outside disturbance caused the 
child’s failure. Help can be given tlie subject unconsciously by facial expres- 
sion or words of encouragement. The person taking a test is always con- 
cerned with the question, “How am I doing?” and watches the examiner for 
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indications of his success. Suppo.se he is given the problem; “Repeat back- 
ward 2-7-5-1-4.” He may begin “4-1-7 . . if the e.xaminer, on hearing 
the “7,” permits his facial expression to change, the subject may take the 
hint and catch his own mistake. The examiner must maintain a completely 
unrevealing e.xpression, while at the same time silently assuring the subject 
of his interest in what he savs. 

Maintaining rapport with the subject is necessaiy if he is to do well. That 
is, the subject must feel that he wants to cooperate with the tester. A teacher 
who knows the child, or a counselor who has worked with an adult, can often 
secure more spontaneous and representative performance than a stranger 
called in to administer tests. Those who are acquainted with the subject, 
however, will be less impartial, and must be unusually cautious in following 
procedures. No rules can be given for the establishment of rapport, but test- 
ers with pleasing personalities will develop many techniques. The person 
who proceeds coldly and “seienh'flcally” to administer the test, without con- 
vincing the subject that he regards him as an important human being, will 
frequently find it difficult to maintain cooperation. Poor rapport is evidenced 
by inattention during directions, “giving up” on a problem before time is 
called, restlessness, finding fault with the test, and similar symptoms. Addi- 
tional suggestions for improving motivation will be made below. The reader 
who has not observed individual testing of children will probably be inter- 
ested in the verbatim transcript of a test listed among the readings at tlie end 
of this chapter. 

The characteristics usually required of a leader are helpful in testing. 
Poise, a clear voice, good diction, and a pleasant manner are assets to be 
cultivated by tlie tester. A tester who is shy may obtain poor results from a 
group test, until he overcomes his shyness by familiarity with the task. 

PROCEDURE FOR TEST ADMINISTRATION 
Conditions of Testing 

The author of a test provides instructions for testers which must be fol- 
lowed uniformly. But many aspects of the test situation the manual does not 
discuss. The first of these is the physical situation where the test is given. If 
ventilation and lighting are poor, subjects will be handicapped. On speed 
tests particularly, their scores will be lower than they deserx^e if they do not 
have a convenient place to write, including sufficient space to spread out 
whatever materials they need. Subjects must be placed so that they can 
hear directions and see the gestures of the tester clearly. Very large rooms 
are generally bad for group testing, unless proctors are stationed to watch 
closely the performance of subjects. The large room has the disadvantage 
that a person may hesitate to ask a question about unclear directions which 
he would raise before a smaller audience. This may be solved by having him 
raise his hand so that a proctor will come to his seat and answer his question, 
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The state of the person tested affects the results obtained. If the test is 
given when he is fatigued, when his mind is on other problems, or when he 
is emotionally disturbed, results may not be normal for him. Occasionally it is 
necessary to test a person at an unfavorable time, as when psychological ex- 
aminations must be given a criminal at the time of his trial, or when tests 
must be administered during Army induction procedures. In such cases, 
great care is required if rapport is to be established, and the possible in- 
validity of results must be noted. A more common illustration of this prob- 
lem is the giving of tests to college freshmen. These tests, to be used in 
classification and guidance, are frequently given in the midst of a hectic 
week of orientation, college activities, establishment of new friends and liv- 
ing arrangements, and adjustment to a semi-adult world. Sometimes a fi esh- 
man who later proves to be nonnally intelligent does very badly on place- 
ment tests because of homesickness, distraction, emotional exhaustion, or 
unidentified causes. Tests given under these conditions do have predictive 
value for most of the group, but many individual scores are untrue as a pic- 
ture of the student. If a test must be given at a psychologically inopportune 
time, the only correct procedure is to maintain an adequately critical atti- 
tude toward results. Many times, however, testing can be improved by alter- 
ation of conditions. Spacing of tests to avoid cumulative fatigue, providing 
for adequate rest on the night before tests, and administering the program 
with a minimum of bustle and confusion are helpful. 

The Army General Classification Test was often given just after induction 
when the men lacked sleep, were recovering from a farewell party, or felt ill 
from inoculations. In one study men who took a second form of the test after 
having familiarity with it and after becoming stabilized in Army routines 
raised their scores, some more than one standard deviation. The mean in- 
crease was 11.25 points (s.d. of scores = 20 points) (4). 

Time of day may influence test scores, but it is rarely an important factor. 
When subjects are alert and eager, they are more likely to give their best 
results than when tired and dispirited. Evidence is adequate, however, that 
equally good results can be produced at any hour if the subjects want to do 
well. Fatigue apparently affects motivation rather than the ability one can 
summon up if he tries. The most thorough study was made by psychologists 
in die Army Air Force. Differences in scores earned at different hours were 
found to be negligible (10, p. 237). The cadets tested were extremely eager 
to make good showings, since their opportunity to become pilots hinged on 
the test results. 

The health of the subject usually affects the efllciency of his activities, 
although here, too, good motivation may overcome handicaps. Since some 
persons in any large group will be in poor health, no matter when the test is 
given, the tester should note individuals in poor health so that test scores for 
them will be recognized as possibly invalid. 
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Control of the Group 

No general rules need be laid down for individual testing, since the more 
complicated individual tests are learned under supervision, so that faulty 
procedmes are corrected. For successful group testing, a few devices for 
keeping the group working as a unit are generally applicable. Group testing 
is essentially a problem in command. The subjects must do as the leader tells 
them, they must do it instantly, and every person must do the same thing. 
This attitude must be maintained, without interfering w’ith the opportunity 
of individuals to ask legitimate questions. One person should be in com- 
mand; he should be in front of the group, where he can see all members. He 
will find helpful the adage, “Never give an order unless you e.xpect it to be 
obf^ed.” False starts, preliminary attempts to call the group to order while 
late-comers are still finding seats, and ineffectual gavel-rapping make it more 
difficult to secure real obedience when work begins. Commands and direc- 
tions should be given simply, clearly, and singly. The speaker should have 
full attention before he starts to talk, so that repetition will not be necessary. 
A complex instruction: "Take your booklet, turn it face down, and then 
write your name on the answer sheet.” will lead to misunder.standing and 
questions. It is better to break the instruction into unmistakable simple units: 
“Take your booklet.” (Hold a sample up, and watch the group to be sure 
everyone has taken his booklet before proceeding.) “Turn it face down." 
( Demonstrate and wait until everyone has complied. ) “Now take your an- 
swer sheet” ( Exhibit a sample, and wait for cx)mpliance. ) “Write your name 
on the blank at the top, last name first.” By tliis technique, the subjects have a 
chance to ask questions whenever they are necessary. But the examiner 
should attempt to anticipate all reasonable questions by full directions. 

Militaiy techniques are effective for control of a group. When a military 
manner is assumed, however, it may enhance the “inhuman” character of 
the test situation and give some people the feeling that the examiner is not 
intere.sted in their welfare. Effective control may be combined with good 
rapport if the examiner is friendly, avoids an antagonistic, overbearing, or 
faultfinding attitude, and is informal when formal control is not called for. 
After establishing control, for example, he may often relax his “command 
manner” and make informal comments about the test and its purpose; this 
does not interfere with his resuming formal control for the test proper. 

The tester must take care to collect materials so that copies of questions 
will not be carried away. Persons not taking a test often wish to see copies of 
the questions, especially when some reward is attached to good test perform- 
ance. The probability of questions being smuggled out is small, but this can 
be prevented if each student is given one and only one set of materials when 
the test begins, and if these are collected from him personally before he 
leaves. 
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Exceptional conditions arise which prevent uniform testing of all persons. 
Occasionally a person becomes ill during the test and must leave the room. 
Usually it will be possible to collect the materials, indicate that the test is in- 
validated, and provide for a make-up on another occasion, perhaps with a 
parallel test. The goal of the tester is to obtain useful information about peo- 
ple. There is no value in adhering rigidly to a testing schedule if that sched- 
ule will not give true information. Common sense is die only safe guide in 
the exceptional situation. 


1. The ACE Psychological Examination uses a test booklet and a separate answer 
.sheet, with a special soft pencil. Formulate a set of statements to be made by 
the tester who, with the assistance of proctors, must collect these materials from 
a group of 200 adults. 

2. An employment office makes the practice of routinely testing all applicants on 
an intelligence test at the time their applications are filed. One man takes the 
test, together with several hiends, and the group leave together. Ten minutes 
later he returns, greatly agitated: “Was I supposed to turn over the last page? 
I thought I had finished when I got to the bottom of page 9, so I looked back 
over my answers. I had plenty of time, and I’m sure I could have done well on 
the last page — ^my friends say the questions there were easy.” What should be 
done in this case, if at the bottom of page 9 the booklet carried the printed 
.statement "Go on to the next page.”? 

3. In testing a group of college freshmen to obtain information for use in guidance, 
the examiner finds that a Chinese student, newly arrived from China, is having 
great difficulty followng dhections because of mifamiliiuity with English. The 
student asks many questions, requests repetitions, and seems unable to compre- 
hend what is desired. Wliat should the examiner do? 

4. In the course of a clinical analysis of a preschool child who is in some way 
poorly adjusted, a report on a series of tests is requested. The p.sychometrist who 
gives the tests finds that the child is persistently negativistic and after cooperat- 
ing reluctantly with two tests becomes inattentive and careless on the third. As- 
suming that the test results are needed as soon as possible, what should be done? 

Directions to the Subject 

The most important responsibility of the tester is giving the proper direc- 
tions to the subject. The purpose of standardized tests is to obtain measure- 
ments which may be compared with measurements given at other times; it is 
therefore imperative that the tester give the directions exactly as provided in 
the manual. If the tester understands the importance of this responsibility, it 
is simple to follow the printed directions, reading tliem word for word, add- 
ing nothing and changing nothing. It is usual to provide for the subject to ask 
questions after the directions have been read. In answering such questions, 
the tester must not add to the ideas expressed in the standard directions, since 
such supplementation might give his group an advantage over groups not 
having such aid. The directions are "part of the test situation; in many tests, 
including certain intelligence tests and projective tests of personality, the 
way the subject follows directions is intended to influence the score. 
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This advice is readily followed in most situations, but problems arise when 
subjects ask questions not provided for in the directions. Some are purely 
technical (“Do I need an eraser?” “Do we have as much time as we want for 
the test?”). Other questions that are particularly troublesome deal with mat- 
ters not mentioned in the standard directions, answers to which would help 
the student to obtain a better score. Examples are: “Should we guess if we 
are not certain?” “How much is taken off for a wrong answer?” “Are there any 
catch questions?” “If I find a hard question, should I .skip it and go on, or 
should I answer ever)' question as I go?” The published directions to the test 
were evidently not adequate. If the tester refuses to give an answer to the 
question on guessing — and he mu.st refuse if the scores are to be compared 
with norms — some subjects will guess and some will not. Tlie directions are 
therefore standard in a verbal sense only, since procedure becomes unstand- 
ardized when members of the group interpret indefinite instructions in their 
own way. Investigators of aviation performance have pointed out the crucial 
importance of making sure that the task is clearly defined for the subject. In 
observing pilots’ ability to execute a flying maneuver they found it advisable 
to tell the subject exactly how the performance would be scored. Otherwise, 
one pilot would keep his attention on maintaining correct altitude while an- 
other of equal ability would earn a different score because he concentrated 
on maintaining correct heading (7, p. 8). One can recommend that all tests 
should be provided with directions which leave no ambiguities for variable 
interpretation. The tester who must use a standard test for which direcHons 
are imperfect is faced with a dilemma for which no ideal solution exists. 

The projective test is an interesting exception to the traditional psycho- 
metric pattern which defines the task very objectively. In tliis test the subject 
is given a very vaguely defined task, and the object of testing is to observe 
how he attacks it. For instance, he may be given an inkblot and told to report 
what it looks like to him. If he asks how many ideas to report or whether to 
use the same portion of the blot in two ideas or any other question, he is told, 
“That’s up to you.” Of course, the procedure for this test is standardized in 
its very vagueness, which is kept the same for all subjects. 

5. The California Test of Mental Maturity consists of a series of 16 sections, each 
of which contains a different type of item. The sections are separately timed, 
each being about 3-5 minutes long. Is tliere any reason why a high school seek- 
ing data for guidance should not give pupils one or two sections of the test each 
day until all of it is taken, rather than gi^’ing it in just two sittings as the manual 
suggests? 

6. “Exact timing of the tests is extremely important. Deviation of even five sec- 
onds can increase a score ten points. A score on an incorrectly timed test is 
valueless.” This statement from the test manual refers to the two- and three- 
minute subtests of the SRA Reading Record. What does the test assume about 
the way in which eighth- to twelfth-grade pupils will use the working time 
allotted? 

7. Some workers have tried to make the inkblot test more uniform by setting a 
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definite time, say, two minutes, for responding to each blot. What argument 
can be advanced against a worker’s adopting such a procedure (1, p. 379)? 

8. The ACE Psychological Examination (1939 edition) contained several sub- 
tests, each of which was preceded by practice exercises occupying a total of 
nineteen minutes. Says a reviewer, “Personal experience indicates that the 
time devoted to these might be reduced to about two minutes, thus saving 
about seven minutes, which would allow the test proper to be lengthened’ 
(2, p. 200). Wliy may a tester not reduce the time for the practice exercise 
if he wishes? 


Guessing 


Because misunderstanding of the problem is common, we shall digress to 
consider the effect of guessing upon test scores. In every test, there are some 
answers of which the subject is not sure, although his doubt may range from 
slight lack of confidence in an answer to total ignorance. Some tests direct 
him to guess when in doubt; others direct him not to guess. The latter direc- 
tion ca.nnot be followed literally, since there are borderline items where he is 
neither totally sure nor totally guessing. Experimental studies show that if 
people know they will be penalized for wrong answers, some are hesitant to 
answer, but others answer freely even when in doubt (9). This diflFerence in 
“tendency to gamble” is not eliminated by any change of directions or penal- 
ties. As the penalty becomes more severe, guessing diminishes, but the rash 
still take more chances than the timid. 

Guessing affects the score received. If there is no penalty for wrong an- 
swers, the guesser receives a better score than the non-guesser of equal abil- 
ity. It is common for tlie test constructors to compensate for success due to 
guessing by a formula for chance correction. If two-choice items are used, 
there is one chance in two of success by pure guessing, and the score is com- 
puted by the formula “Right minus wrong.” For a multiple-choice item, the 

correction formula is “Right minus where n is the number of choices 

n — 1 


offered. Numerous technical flaws have been found in these formulas, but 
the most serious criticism is that they do not “correct for guessing.” Guessing 
is not a matter of pure chance; the guesser should, on the basis of background 
and common sense, be able to guess the right answer on an aptitude or 
achievement test more often than if he decided his answer by a chance 
method, such as rolling dice. If a person guesses on 10 five-choice items, he 
will probably get at least three or four items right, instead of the two items 
chance would predict. But another person, who does not guess, will receive a 
score of zero on the same 10 items, so that the score depends less on knowl- 
edge than on the tendency to gamble. Individual differences in this tendency 
can be eliminated by directing everyone to guess, but guessing introduces 
large chance errors. 

Empirical studies have been made of the reliability and validity of tests 
given with guess” and “do not guess” directions, In general such evidence 
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has shown statistical superiority for “do not guess” instructions. For indi- 
viduals who are over-timid or who guess despite instructions, “do not guess” 
directions lead to invalid scores, biit for the group as a whole these directions 
seem to give the most useful results. The problem of guessing is less impor- 
tant where the items are at a level of difficulty the person can cope with; 
chance errors in score are more numerous when many items require guessing. 

9. Give a difficult five-choice test, such as the Nelson-Denny vocabulary test, to 
a friend with instructions to answer items only when fairly certain of the cor- 
rect answer. When he has finished the test, provide a pencil of another color, 
and direct him to answer all the remaining items, making the be.st guess he 
can. Determine his raw score on each trial with and without correction for 
chance. Suppose the test were given with “do not guess” directions; how much 
would he gain or lose by gues.sing despite the directions? 

10. Jn taking a true-false test scored “Right minus wrong," would a student be 
wise or unwise to give an answer to every question even when he is uncertain? 

11. Some instructors advocate scoring tests by formulas which penalize guessing 
very hetivily, such as "Number right minus twice number wrong.” What ef- 
fect would this have on validity of measuj'ement ( ?, 5 ) ? 

12. Discuss the wisdom of the practice described in the following paragraph: 

A vocabulary test constructed for Air Force classification contains 1.50 five- 
choice items. The time limit is 1.5 minutes. The directions state: “This is not a 
speed te.st. Your score does not depend so much on how many items you try 
to answer as it does on how many you get right on each page you attempt.” 

The test is actually a speed test, and the score taken is right minus — 

13. In the SRA Reading Record, Test 2 (Comprehension) presents 16 four-choice 
items about a paragraph just read. One item is “Methods of harvesting were 
entirely revolutionized in (a) half a century (b) a decade (c) 200 years (d) 
a century.” In the eighth grade, norms are as follows: 14 right, 99th percen- 
tile; 13, 98th; 12, 92nd; 11, 84th; 10, 71st; 9, 67th; 8, 42nd; etc. Suppose that 
certain pupils, after answering eight items they can recall from the paragraph, 
use the remaining time to make a be.st guess on the other eight items. Estimate 
roughly how' much these pupils will raise their percentile scores, if they have 
normal .success in gues.sing. 

MOTIVATION FOR TEST-TAKING 
The efficiency of any performance depends upon the strength of the moti- 
vation; test-taking is no exception. It is a familiar experience in the home to 
find Johnny too tired to continue mowing the lawn, only to discover him 
fully recovered when a baseball game is proposed. This is not necessarily 
faking; we rarely use our highest level of output unless we have a great deal 
at stake. The purpose in ability testing is to determine the person’s highest 
level of performance; to do this, it is important that he give as near to his 
maximum as we can induce him to give. Measuring weight is a simple matter 
— we place the subject on the scale, and a rather good measure will result no 
matter how he feels about the measurement. Measuring mental character- 
istics is quite another matter. The subject must place himself against the 
scale, and unless he cares what results, no measurement can he had. 
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Fortunately there are a number of positive motives to lead the subject to do 
as well as he can. At times, there is an expectation of direct reward from good 
test performance; this is ti-ue in industrial testing and in military classifica- 
tion. In such cases, performance less than one’s best is uncommon. Rewards 
cannot ordinarily be provided, however, where they are not inherent in the 
purpose of testing. Definitive studies have not been made of the role of exter- 
nal rewards, such as promising a dollar to every person who does better on a 
test than he did the week before; one can guess that this would produce some 
changes, but such devices are impractical (8). Where no direct reward is 
offered, the subject usually is conditioned to good performance by his pre- 
vious experience in a competitive world. He has been taught to try to sur- 
pass others; even where no results are to be published, the urge to "make 
good” is powerful. Social facilitation may also be helpful in some cases; when 
one is in the middle of a group, all devoting their efforts in the same way, he 
must be a rugged individualist indeed to take an attitude less serious than the 
others. The motivations of our competitive world are not accepted by all 
persons equally. Most middle-class children are imbued with the spirit of 
getting ahead through achievement. For them a mental test may be a chal- 
lenging, exciting experience. Many lower-class children, according to cul- 
tural studies, care little about school and educational values; for them the 
test may be too unmotivated to do them justice. 

Positive motivation itself may be harmful in tests where “typical” be- 
havior is sought. As will be seen in connection with personality testing, the 
urge to "do well” may cause one to distort his answers to some types of tests. 

The motivation most helpful to valid testing is a desire on the part of the 
subject that the score be valid. This is not the normal competitive set, where 
one desires a high score whether it is true for him or not. It is a scientific set, 
a desire to find out the truth even if the truth is unpalatable. Such a set, 
where the subject actually becomes a partner in testing himself, is not usually 
present in industrial, educational, clinical, and research testing. More nearly, 
an autocratic approach is followed, something like “Take this test and I shall 
decide what is to be done with you.” Most testers would disclaim any inten- 
tion of dictatorship, yet it is true that tests have most often been used for 
the private information of the tester, who then bases recommendations on 
them. 

Cooperation between tester and subject is a not impossible goal. Where 
psychotherapy or counseling is based on diagnostic testing, where school ad- 
ministration is facilitated through results from standard tests, or where an 
employment manager must take responsibility for hiring tire best-qualified 
applicants, responsibility will ordinarily not be transferred to the person 
tested. The point of view, however, that makes the subject a member of the 
tester’s team (and vice versa) can be carried into practice in most situations 
by taking him into confidence as to the purpose of the testing and letting him 
feel that the test is an opportunity to find out about himself. The physician 
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often finds it helpful to tell the patient what medicine he is being given and 
what good results are to be expected from it. If the person taking a test knows 
what it is measuring and why a fair measurement of that characteristic is to 
his advantage, he will have little motive to provide an untruthful picture. 
Perhaps the most “autocratic” of the current uses of testing is in industrial 
hiring — ^necessarily so, since the goal of testing is efficiency in industry. Yet 
the tests given in the hiring line are to the advantage of the person tested, 
and it will build good will if he knows it. The very facts regarding turnover 
that motivate the employer to screen applicants are facts which would reas- 
sure the worker if he knew them. If he does well on the test, he can have 
confidence of making good on the job. If he does badly, he would probably 
be laid off later, whereas the tests will make it possible for him to begin ac- 
cumulating experience and seniority in another job for which he is fitted. 

Indoctrination regarding the nature and purpose of tests is desirable be- 
fore they are given. When tests are given without explanation, a sort of co- 
operation is obtained by virtue of the authority of the tester. Explanation 
makes clear what is to be done and why. An example of possible sound orien- 
tation procedure is the film Cadet Classipcation made by the Army Air 
Corps. Shown to soldiers before testing, the film would be an excellent intro- 
duction to the rather awesome array of tests given them. The film showed 
the tests, gave some insight into their purpose and the importance of follow- 
ing directions, and presented a dramatic scene in which Frank, who was 
classified by the tests as bombardier rather than pilot, was convinced that “a 
pilot who cracks up in training is one for the enemy; but a live bombardier 
over Tokyo is one for our side.” 

The College Entrance Examination Board, which obtains data to be used 
as a basis for admission to college, distributes to applicants a booklet of in- 
struction and practice questions well in advance of the test. These questions 
give candidates an imderstanding of what is expected. They also reduce the 
element of surprise and misunderstanding of directions in the excitement of 
testing. 

With young children and clinical cases, reliance on the tester’s pleasing 
personality is more important, but with otlier subjects understanding the 
purpose of a test gives every motive to cooperate. If it is practical, a promise 
to discuss privately the subject’s test results is of great motivating value. 

14. How could a “cooperative” point of view in testing be adopted: 

a. By a school principal who wishes to divide his eighth grade into sections 
on the basis of intelligence? 

b. By a veterans’ counselor who must approve the plan of a handicapped vet- 
eran to go to college and prepare for dentistry? 

c. By a consulting psychologist who is asked by a social agency to diagnose 
and report on a potential delinquent? 

15. What explanation would you give the subject in each of the following cases? 

a. College fre.shmen are to be tested to determine which ones may fail because 

of reading deficiency. 
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b. At the end of a course in industrial relations for foremen, an examination on 
judgment in grievance cases is to be given. 

16. Hebb and Williams devised a test to measure the intelligence of rats (6) . The 
test consisted of a set of mazes to be run, success being scored if a direct path 
to the foodbox was taken. What problems of motivation would need to be con- 
sidered in administering this test? 

17. In prewar Japan, a young man’s chance for success in life depended on his 
capturing one of the limited number of openings in the higher schools. These 
vacancies were filled on the basis of competitive examinations and other data. 
Magazines bearing such titles as Student days. Examiners’ circle, and Period of 
diligent study were widely cii'culated. These dealt with topics of interest to 
candidates including infomiation about test materials, although the tests them- 
selves were of course guarded. Would such magazines increase or decrease the 
validity of test scores? 

18. In planning a competitive mental test to be given all Japanese youth applying 
for higher schools, two policies appeared possible. One was to devise new 
types of test items each year, so that knowledge about previous examinations 
would be of no help. The other proposal favored using the same types of ques- 
tions every year (for example, number series) but changing the items used. 
Discuss the relative merits of the plans from the point of view of the test 
maker, the student, and the person interpreting the results. 

19. According to the motion picture “Personnel Selection,” British soldiers to be 
classified were tested in groups. On one test, the agility test, however, each 
man was tested separately while his group of perhaps 20 others watched. The 
task called for rumiing back and forth along a cross-shaped pattern, transfer- 
ring rings from one post to another. 

a. What effect on score would be expected from being tested in a group rather 
than without an audience? 

b. What effect would be expected as a result of announcing each man’s score 
at the end of his trial — to be applauded if good? 

c. What advantage or disadv.intage would a man have who comes last in 
the group? 

It may seem surprising that there are motives which lead to lower, rather 
than higher, scores, yet this problem is not rare. Sometimes the subject wants 
to do well, yet earns an unduly low score; at other times he actually desires to 
do poorly. A common source of poor performance is anxiety and tension due 
to a strong desire to do well. When one is tense, he is likely to overlook errors 
that he would readily recognize otherwise. In psychomotor tests, tension 
leads to poor coordination and more erratic behavior than tlie best the sub- 
ject is capable of. The subject who fears criticism of his answers may attempt 
to avoid it by being overcritical of himself. In individual mental testing, pa- 
tients filled with anxiety frequently attack their own answers, finding fault 
with their statements or elaborating them to include all possible variations 
and qualifications. By this they may spoil an answer which would have re- 
ceived credit. Another familiar mechanism is blocking, where the subject 
delays his response perhaps through fear of criticism, perhaps through cau- 
tiousness and a desire to think through a problem before answering. Some 
deviation in the direction of cautiousness may be unimportant save as a mark 
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of personality, but cautiousness rooted in emotional tension is likely to de- 
press the score. It is this characteristic which makes it possible to use indi- 
vidual mental tests to diagnose personality deviations ( see Chapter 7 ) . It is 
impossible to remove personality factors from the testing situation, but the 
tester must do his best to minimize emotionality, while at the same time 
maintaining a high level of interest on the part of the subject. 

When the subject develops fear of a test, not only are the results likely to 
be unfavorable, but the subject may develop permanently harmful attitudes. 
Every test is a threat, to some degree, to the pride and status of the subject; 
it is this quality which makes it challenging. But if one fears the damage the 
test may do, the pressure becomes higlily unpleasant and he seeks an escape, 
one escape frequently encountered is refusal to respond. Some subjects will 
say “I don’t know” rather than attempt an item where they fear failure. If one 
tries his best and is defeated, he must admit even to himself that he was in- 
capable of success. If, however, he gi%'es up, making only a half-hearted ef- 
fort, he has protected himself, since he can rationalize: “I could have done 
as well as the others if I had tried, but I didn’t care what mark I made.” Such 
negativism may be accompanied by attacks upon the test or the examiner; 
if one rejects a test as “silly,” “childish,” or “unfair,” he has excused himself 
for failure to perform well. 

A test is necessarily a frustiating experience, since it is designed to include 
many items which the student will fail. Frustration usually disrupts behavior; 
the task of the tester is to make frustration mild. This can only be done where 
the threat implied in the test is minor. Few tests are life-or-death matters; but 
to the subject they may seem so. If a delinquent fears that his punishment 
will hang on the test results, if a child fears that a poor intelligence rating will 
disappoint his parents so that he may lose their affection, if a college girl 
fears that failure will force her to leave her campus friends and return to the 
farm, if a young man fears that a test will prove him insane, or if an unem- 
ployed man fears that he will be unable to get a job if he does poorly, he will 
be very conscious of the threat from the test. A striking example is the case 
of the young reserve officer, extremely eager to serve in time of war, who 
failed his physical examination twice because the importance of passing 
made him emotional — and the emotion always brought his blood pressure 
over the acceptable limit. A series of “reconditioning” treatments eventually 
made it possible for him to take the test calmly. ( A more interesting solution 
was reported in anotlier case, that of a flyer failed repeatedly owing to ten- 
sion. The examining doctor, sure the young man was physieally sound, finally 
gave up, saying, “I’m sorry, son, but you’re out. Your blood pressure readings 
are just too high. But let’s see, since you’re washed out anyway, what your 
blood pressure really is.” Relieved by a definite decision, even an adverse one, 
the young man took the next test calmly, made a satisfactory record, and was 
approved by the sympathetic doctor. ) High blood pressure is more directly 
an emotional concomitant than is poor thinking, but the situations just de- 
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scribed have their mental counterparts. Insofar as the tester can convince 
the subject that the tests will be used to help him, not to harm him, and that 
a failure will not debar him from his goals if he is later able to produce evi- 
dence of his ability, the validity of scores will be increased. Emphasis must 
be placed on the positive use of results. A job applicant fearful of failing an 
aptitude test can be given to understand how test scores may indicate what 
work he would succeed in. A patient fearing the verdict of a diagnostic test 
should understand that it will point the way to methods of curing him, 

Cases may arise in which the subject frankly wishes to do poorly on a test. 
There are times when students try to limit tbeir scores on mental tests be- 
cause of a classification plan in which, it is nimored, the better students will 
be required to do more work. In militaiy selection, as it becomes suspected 
that passing certain tests qualifies one for unpopular assignments, there is a 
temptation to fail them deliberately. Air Corps cadets reported (truthfully 
or not) having tried to do poorly on tests related to the post of bombardier, so 
that they would have a better chance of becoming pilots. Another motivation 
was that of the boy who deliberately failed his school subjects, so that instead 
of being promoted he would be kept in the grade where his less intelligent 
friends were to remain. 

It is not possible for the tester fully to control motivation. The most helpful 
recommendation is that he should be aware of the motives in his group, 
counteract them as well as he can, and discount whatever invalidating effect 
they may have upon test score. 

SUGGESTED READINGS 

Biber, Barbara, and others. “Stenographic record of psychological examination," 
in Child life in school. New York; Dutton, 1942, pp. 631-639. 

Bingham, W. V. “Administiation of tests," in Aptitudes and aptitude testing. New 
York: Harper, 1937, pp. 224-237. 

Herben, Mary S. “An analysis of rapport between adult and child, as shown in the 
psychological test situation,” in Thomas, Dorothy S., and others. Some new 
techniques for studying social behavior. New York: Teachers Coll., Columbia 
Univ., 1929, pp. 161-198. 

Lawson, Douglas. “Need for safeguarding the field of intelligence testing.” J. ediic. 
Psychol, 1944, 35, 240-247. 

Terman, Lewis M., and Merrill, Maud A. “General instructions,” in Measuring in- 
telligence. Boston: Houghton MiflSin, 1937, pp. 52-71. 

REFERENCES 

1. Biber, Barbara, and others. Child life in school. New York: Dutton, 1942. 

2. Buros, Oscar K. (ed.). The 1940 mental measurements yearbook. Highland 
Park, N.J.: The Mental Measurements Y'earbook, 1941. 

3. Cronbach, Lee J. ‘The true-false test: a reply to Count Etoxinod.” Education, 
1941, 62, 59-61. 

4. Duncan, Acheson J. “Some comments on the Army General Classification 
Test.”/, appl Psychol, 1947, 31, 143-149. 



HOW TO GIVE TESTS 


97 


5. Etoxinod, Count Sussicran. “How to checkmate certain vicious consequences 
of true-false tests.” Education, 1940, 61, 223-227. 

6. Hebb, D. O., and Williams, Kenneth. “A method of rating animal intelligence.” 
J. gen. Psychol., 1946, 34, 59-65. 

7. “Psychological research on pilot training in the AAF.” Staff, psychological re- 
search project (pilot). Amer. P.sychologist, 1946, 1, 7-16. 

8. Seashore, Harold G. “Tire improvement of performance on the Minnesota Rate 
of Manipulation Test when bonuses are given.” J. appl. Psycliol., 1947. 31, 
254-259. 

9. Swineford, Frances. “Analysis of a personality trait.” /. educ. Psychol., 1941, 
32, 438-444. 

10. Thorndike, Robert L. (ed.). Research prohlems and technUjues. AAF A^a- 
tion Psychol. Prog. Res. Rep. No. 3. Washington: Go\1;. Printing Oft’., 1947. 







CHAPTER 6 


The Binet Scale and Its Descendants 

EMERGENCE OF INTELLIGENCE TESTING 
Tests Before Binet 

The o utstanding success of scientific measurement of individual differences 
in behavior has been the intelligence-test movement. Despite the overen- 
thusiasm and occasional errors that have attended their development, mental 
tests stand today as the most important single tool psychology has developed 
for the practical guidance of human affairs. Among mental tests, none has 
been more influential tlian that fathered by Alfred Binet, in its many forms. 
A history of mental testing is in large part a history of the Binet test, its ante- 
cedents and its descendants. 

The fi^t systematic experimentation on individual differences in behavior 
arose frogi the accidental discovery of differences in reaction time among 
astronomers. In 1796, an observatory assistant at Greenwich named Kinne- 
brook was engaged in recording, w'ith great precision, the instant when cer- 
tain stars crossed the field of the telescope. When Kinnebrool^s results 
were found to be consistently eight-tenths of a second later than tlie observa- 
tions of his superior, the Astronomer Royal, he was thought to have been in- 
competent in his work and was discharged. It was not until twenty years 
later that more careful study showed that the differences between observers 
wergjhe result of the different speeds with which they could respond to stim- 
jjU. Only gradually did such differences come to be recognized as significant 
facts abo ut human nature, rather than annoying errors contaminating scien- 
tific wQrk. 

Physiologists, psychologists, and anthropologists were stimulated by the 
scientific climate of the nineteenth century to make a great variety of meas- 
urements of human characteristics. Notable among these early workers was 
Galton, whose interest in differences among individuals was stimulated by 
_Pai:win s newly published theory of differences between species. Galton’s 
studie s during the latter half of the nineteenth century included numerous 
explorations of ways of measuring physical characteristics, sensory, qualities 
such as keenness of hearing, mental imagery, and so on. Tbssp methods were 
not developed fully by Galton, but served as models, for later tests. In addi- 
tion, Galtonls ,erddmc;e that„QUtstanding intellectual achievement was oftejn 
f ound in s uccessive generations of certain families emphasized that genius 
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was not an accident or a gift of the gods, but rather an important phe- 
nomenon to be investigated scientifically. 

At this time, psychologists were only beginning to apply objective tech- 
niques, and naturally turned their attention first to those tilings which could 
readily be studied in the laboratory. Wundt, with his laboratory in Leipzig, 
won a notable triumph in developing objective psychological laws compara- 
ble to those of physics. Necessarily, however, the search for such simple 
laws led him and his students to concentiate on simple phenomena. More- 
over, since it was thought that the function of science was to analyze be- 
havior into its simplest elements, a deliberate search was made for pro- 
cedures having high logical validity, measuring very limited functions. 
While. Wundt was not concerned with individual differences, his laboratory 
methods had a strong influence on early tests. As early as 1890, Caltell 
brought to the United States from Germany a collection of tests of such 
functions as sensoiy acuity, strength of grip, and memory. This development 
unfortunately met an early debacle when it was discovered that, while 
measuring “pure” characteristics, the new tests seemed to have no relation to 
significant human behavior. The crucial study was Wissler’s work on test 
scores of Columbia students, reported in 1901. He correlated college marks 
vrtth many of the Cattell tests, finding such negligible correlations as the fol- 
lowing: reaction time, —.02; canceling a's from a printed page rapidly, 
—.09; naming colors, .08; auditory memory ( recall of digits), .16 (48). Apart 
from the insignificant functions tested, low correlations were certain to 
result because of two principles mentioned in Chapter 4: brief tests are 
quite unreliable, and validity coefficients are always reduced when a group 
is highly selected. The disappointment which followed the Wissler study, 
however, delayed attempts to use measurement in applied psychology. 

A third line of attack stemmed from abnormal psychology. The tests of 
Galton were a miscellaneous lot suitable to studying human variation in gen- 
eral; the psychophysical tests were selected because they were precise and 
suitable to the laboratory. In contrast, the tests of the early clinical workers 
were intended to serve a practical end. To study mental defectives and 
pathological cases, clinicians required tests of complex mental functioning. 
Numerous workers, nqtebly Kraepelin in Germany, developed , tests which 
were of value m distinguishing normal from abnormal subjects. Few of these 
tests, however, were perfected to the point of clinical usefulness. 

The Binet Tests 

Alfred Binet, a French psychologist, had become interested in studying 
judgment, attention, and reasoning. His interest in these more complex men- 
tal processes led him to try a greater variety of tests than his predecessors 
used. Binet was concerned with obtaining insight into real behaviors and did 
not restrict himself to “pure” measures. In studies published between 1893 
and 1911, he explored in many ways the differences between bright and dull 
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children. With little preconception regarding this difference, he tried tests in- 
cluding recall of digits, suggestibility, size of cranium, moral judgment, tac- 
tile disci imination, mental addition, graphology — and palmistry! As he 
found, with the earlier investigators, that the tests of sensory judgment and 
other simple functions seemed to have little relation ttTgeheral mental func- 
tioning, he gradually formulated a description of intelligence as “the tend- 
ency to take and maintain a definite direction; the capacity to make adapta- 
tions^ for the purpose of attaining a desired end; and the power of auto-criti- 
cism” (41, p. 45). 

The stage was set, then, for the call in 1904 to produce the first practical 
mental test. The schools of Paris became concerned about their many non- 
learners and decided to remove the hopelessly feeble-minded to schools 
whdte they would not be held to the standard curriculum. Aware of the 
errors in teacher judgment, they wished to avoid segregating the child of 
good potentiality who could learn if he tried and the troublemaking child 
whom the teacher wanted to be rid of. Moreover, they wished to identify all 
the dull from good families whom teachers might hesitate to rate low, and 
the dull witli pleasant personalities who would be favored by the teacher. 
Therefore they asked Binet to assist in producing a method for separating the 
genuinely dull from those who had adequate educabilify. Binet’s scale, 
which drew on his earlier studies, was published in collaboration with Simon 
m 1905. In 1908,. a revision was published. The tests were arranged in order 
of difficulty, so that a child passing those tests normally passed by ten-year- 
olds could be considered their mental equal in average performance. Still 
another revision was produced in 1911. The Binet scale, as it stood then,.* 
differs only in detail from the indmdual tests for children in widest use. 
tod^. 

At the time Binet’s tests were made available, there was a great demand for 
objective methods of investigating psychological problems. American psy- 
chological research had been dominated for years by introspection and ques- 
tionnaire techniques, both of which were as fallible as the judgment of the 
subject reporting. Bipet’s rnetho.d, which was to a large degree impartial and 
independent of the preconceptions of the tester, was' welcomed enthusi- 
astically as a research technique and as a tool for studying subnoi-mal chil- 
drem In 1910, Lewis M. Terman began experimentation with the Binet tests 
and prod uced, in. 1916, the Stanford Revision of the Binet Scale. He .e;c- 
t ended its,applicatiQn with normal and superior children, The Stanford- 
Binet had im med iate popularity and became, rightly or wrongly, the yard- 
stick by which oAe r tests we re judged. Although there had been various 
prior mental tests, the outstanding popula rity of the Stan ford test made its 
conception of int elligence the standard. The acce ptance of the St anford te st 
was due to the care with yshich it had been prepared, its succes.s_in.te_sting 
complex me ntal activities , the easily und erstood “IQ” it provided, and t he i m- 
po rtant prac tical results which it quicldy produced. Although many criti- 
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cisms have been made of the test, it was and is an exceptionally useful tool. 

^Changes in individual intelligence testing since, 1916 have been minor. A 
few new test forms were adopted, and several revisions of Binet’s test were 
made by psychologists other than Terman, but all in all, present-day testing 
is taace.able to the 1916 Stanford-Binet in form and conception. The student 
should not assume that other tests before and after Binet had no merit, 
merely because they failed to attain comparable prominence. It is probable 
that the early workers explored many leads which were worth e.xploitation 
and which have been unduly neglected. Binet himself ( following the lead of 
earlier workers ) made use of inkblots to study imaginative and perceptual 
processes, but this technique fell into obscurity from which it emerged only 
because Rorschach, twenty years later, independently revived the procedure. 
In his monograph. The experimental study of intelligence, Binet descrfbeJ 
the application of inkblot and imagery tests to his daughters, coming out 
with clinical, qualitative descriptions of the way their intelligence functioned 
which read as if taken from the most modern results of projective techniques 
(35, pp. 146-150). This lead, which psychologists are today pursuing, was 
neglected while the less informative plan of reducing all intelligence to a 
single score was adopted. The accidents of time and place play a large part 
in psychological historyj there was, in 1905, a great practical need for a 
simple and objective way of comparing children, but no popular demand for 
diagnostic clinical tools. 

The 1916 Stanford-Binet was replaced in 1937 when T erman and Merril l 
published Forms L and M of the Stanford-Binet. These tests improved on 
the construction of the former edition and offered the marked advantage of 
two comparable forms, so that the child could not increase his score on a re- 
test by remembering particular questions. 

Other Developments 

Other individual tests have been designed for special needs. Some tests 
have stressed “non-verbal” or “performance” items, which give a fairer esti- 
mate of intellect for children with language handicaps. Tests have been de- 
signed for special age groups, from preschool children to adults. One of the 
most significant of these is the Wechsler-Bellevue Intelligence Scale. This 
test, first designed for adults and now being extended to earlier ages, is dis- 
cussed at length in the next chapter. 

During the time of the emergence of individual tests, group tests also be- 
came widely popular. Several psychologists had devised experimental gro u]^ 
t ests pr ior to 1916, but it remained for World War I to give group testin g its 
rnain imp etus . When it became necessary to expand the Army at an ex- 
ploiivi'fate, the Army requested psychologists to provide a group test for in- 
ductees so that those who were promising could be given officer training, 
those who were unfit could be rejected, and others could be appropriately 
classified. In one of the major achievements of practical psychology, a group 
including Terman. Yerkes, and Bingham assembled (on the basis of test 
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mate rials provided by A. S. Otis) a test which, after experimental revision, 
became famous as Army Alpha. Alpha was a test of ability to follow direc- 
tions, simple, reasoning, arithmetic, and information. It was a practical test, 
easily administered and hig hly useful to the Army, as Figure 18 suggests. 
It convinced people that adequate prediction of human success cjndd be 
made byjnass processing, so that follt»\ying the war schools and industry,de-. 
mandedjests of this type. Alpha in a civilian revision and comparable group 
tests by Otis and others were widely applied. .Tests being published tp.day, 
and those used in World War II classification, differ only .slightly from those 


Officers 



Fig. 18. Alpha scores of Army personnel of various ranks in 
World War I (49). 


of the 1920 .vintage. The principal difference is that care in construction has 
made the newer tests more reliable and provided better norms. 

^ One major dmmlopmfht has been the emergence of diagnostic mental 
testSj^ which seek to meas ure separ ately the various aspects, of mentahty’^ in- 
s tead of arriving at a single intelligence rating. The most complete develop- 
ment of this a pproac h is fo und i n .Thu^stone’s Tests of Primary Mental Abili- 
ties. Thurstone has separated mental functions into such aspects as verbal, 
number, .s pace, and r eason ing, and has prepared separate tests of these abili- 
ties. While the many separate tests require more time than the usual test of 
general ability, tliey provide more information and will probably have prac- 
tical importance. To dat e, they liave been used principally for research. 

The foregoing history is far too sketchy to do justice to the niimierdus con- 
tributors to present-day testing, or to indicate the serious and significant con- 
troversies regarding intelligence which were gradually resolved. The reader 
who wishes to trace the emergence of current concepts will find Petersons 
book (35 ) especially helpful. 

CHARACTERISTICS OF THE STANFORD-BINET SCALE 
Assumptions About Intelligence 

In the Binet scale, as in every test to be studied, one can trace how the in- 
ve.stigator solved four problem s which face every test designer. First, he must 



106 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


decide just what he intends to measure. Second, he must invent or select 
items which adequately serve that purpose. Third, he must find a measuring 
unit in which to express results, since behavior rarely can be divided into 
such countable units as inches, pounds, or light-years. Fourth, he must shqi^ 
the validity of the test, usually by relating it to an external criterion. Knowing 
how these were answered for tlie Binet test not only reveals wherein it made 
a contribution but also throws light on its limitations. 

The peison making the first mental test is in the position of the hunter 
going into the woods to find an animal no one has ever seen. Everyone is sure 
the beast exists, for he has been raiding the poultry coops, but no one can 
describe him. Since the forest contains many animals, the hunter is going to 
find a variety of tracks. The only way he can decide which one to follow is by 
using some preconception, however vague, about the nature of his quarry: If 
he seeks a large flat-footed creature he is more likely to bring back that 
carcass. If he goes in convinced that the damage was done by a pack of small 
rodents, his bag will probably consist of whatever unlucky rodents show their 
heads. 

Binet was in just this position. He knew there must be something like in- 
telligence, since its everyday effects could be seen, but he could not describe 
what he wished to measure, as it had never been isolated. Some workers, then 
and now, have objected to this circular and tentative approach whereby in- 
telligence can only be defined after tire test has been made. Tests are much 
easier to interpret if the items conform perfectly to a definition laid down in 
advance. When faculty psychology was in vogue, many separate tests were 
designed for the separate mental faculties: reasoning, memory, attention, 
sensory discrimination, and so on. None of these tests used singly, however, 
was found to be useful. Terman explains this as follows: 

“The assumption that it is easier to measure a part, or one aspect, of intelligence 
than all of it, is fallacious in that the parts are not separate parts and can not 1^ 
separated by any refinement of experiment. They are interwoven and inter twin ed. 
Each ran-.ifie.s everywhere and appears in all other functions. . . . Memory, for ex- 
ample, cannot be tested separately from attention, or sense discrimination sepa- 
rately from the associative proces.ses. After vainly trying to disentangle the various 
intellective functions Binet decided to test their combined functional capacity with- 
out any pretense of mea.suring the exact contribution of each to the total product” 
(45, p, 151). 

We will see in later chapters how diagnostic tests have, in spite of this, ob- 
tained useful information about distinct aspects of ability. In Binet s time, 
though, one of liis great contributions was to abandon the idea-ef- separate 
epncgjt of general intglligeuce- Having started with the 
"idea tliat some children were bright and some duU, he fou nd q uick ly that 
those-who .were. best on tests of judgment were also superior oniests of^at- 
tention, ipemo^, vocabulary, and other mental activities. In other words, the 
tests were correlated. This correlation shows that there must b^ onfe tin der- 
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lying unity among mental tests. This is what psychological testers mean by 
“general intelligence”; it is the factor which accounts for the correlation 
amdng'mOTtal tests. 

Binet refined his idea of intelligence by trial and error. If color matching 
does not correlate with other estimates of mental ability, it must not be in- 
fluenced by the common factor. If knowing certain information correlates 
with the tests of reasoning, both must be in part measures of intelligence. Out 
of a study of his best test items, Binet came to his famous description, quoted 
above. Binet also inferred that intelligence increases as one approaches ma- 
turity. 

1. What do the following definitions of intelligence include that Biiiet’s definition 
does not, and vice versa? 

a. “ The abil ity to do abstract thinking" (Termaiil. 

b. “The "power of good responses frorri "the point of view of truth or fact” 
(E. L. Thorndike). 

c. “The property of so recombining our behavior patterns as to act better in 
novel situations” (Wells). 

2. Would the same sort of test items be called for by tliesc definitions? 

3. Is previous learning included in intelligence by these definitions, and by Binet’s? 

4. Binet first sought to study brightne.ss and diilhiess by comparing children of the 
same age whom their teachers considered superior or inferior. Might the chil- 
dren selected for him differ in any factors other than general intelligence? 

Selection of Items 

If we were testing high-jumping ability, we would ask a boy to jump over 
standards of varying heights, beginning with easy ones and increasing the 
height until we found the highest level at which he could succeed. The ex- 
perimental psychologist uses the same device in measuring weight discrim- 
ination. The test begins with pairs of weights which are easily discriminated, 
and as the subject succeeds the difference in weights is reduced until he can 
no longer tell which is heavier. The Binet test adopts a similar “hurdle” pat- 
tern, It begins with items the subject is e.xpected to pass. As the items be- 
come more difficult, the subject begins to fail, and the test is continued until 
we have discovered the most difficult mental hurdle he can get over. 

Since mental ability is thought of as something that increases with age, a 
good test item is one that is passed by older children, but not by younger 
ones. We would reject a proposed exercise tliat 25 percent of children of 
every age can pass; although this item is difficult, it does not relate to mental 
development. Items for, the Binet test and its revisions were tested by de- 
termim^ how many children at , each age can pass them. How items are 
tested is illustrated in Figure 19, which is based on the standardization data 
for the 1937 test. Item A is a very good item, by the criterion of increase with 
age. It is rarely passed by those under 10, and is passed by almost all 16-year- 
olds. In contrast, item B is one of the least satisfactory items retained in tlie 



108 


ESSENTIALS OF PSYCHOLOGICAL TESTING 

test; it shows very slight increases with age above 11. Items which showed 
even less relation to age were eliminated in making up the revision. 

Binet’s assumption, that a good item measures the .same thing as other 
items, was also applied. Each item was correlated with the total scale (cf. 
fable 10). The correlations varied greatly; item A, just discussed, correlated^ 
.91 with total score of 13-year-olds, but the correlation for B was only .27. 

Items are placed in the scale according to their difficulty for children at 
each age. A test which about 60 percent of 13-year-old children can pass is 



Fig. 19. Percentage of children of each age passing two 
Stanford-Binet items. (Data from 27, p. 96.) A, Item from Form 
M, Year Xlll — defining connection, compare, etc. B, Item from 
Form M, Year Xlii — memory for details of story. 

placed at Year XIII. Both item A and item B are placed at Year XIII. Tim 
percentage Jeyel usedin placing a test varies from age to ago. Tests passe d 
by~ about 77 per cent of two-year-olds are placed at Year II; .tests_ passed by 
about 63 percent of eight-year-olds, go into the Year VIII hurdle. Because ad- 
justments are. made to make the mean 100 at all .age£, all items placed at 
Tbe same level are not exactly .equal in difficul^., 

5. Some nine-year-olds pass item A (Fig. 19). What tentative conclusion can be 
made about their mental age? Why is this conclusion merely tentative? 

6, A Japanese investigator wishes to prepare a counterpart of the 1937 Stanford- 
Binet for Japanese children. In wliat ways would a mere direct translation of 
the test be unsatisfactory? 

Description of Test 

The child is given the Binet test by an experienced examiner, who gives 
each test in the precise manner called for by the directions (44). The ex- 
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aminer begins by establishing rapport, aided by the high interest value the 
“games” have for the younger child and the challenge of the test situation 
for the older child. The first test s ti-ied are . those foi..a mental leyel below that 
Mgected of the child; beginning with easier tests builds confidence. The 
b asal age . the_SfiaJe.. level at which the child can pass all the tests, is de- 
termined. Tests at each higher leyel are given, usually _si.\ at each level. The 
te st IS contin ued until the child fails all tests at some level. 

In Forms L aniH jVT . tests have been prepared for levels of mental develop- 
ment from Year II to Superior Adult III.^At the earliest years, ages two to 
five, thm'e are six tests at each half-year of development. Above age five, 
hurdles are spaced one year apart; and above age 14, the levels have even 
wider spacing. No jdnld-. takes, the entire . set of 129 tests. A nine-year^-.old 
would begin with tests for the eightwear level and, if he passed those, would 
continue until he could no longer succeed. Some nine-year-olds would be un- 
able to go beyond the 11-year level, whereas others would still be passing a 
few tests at the 14-year level. 

Admi nisterin g the test requires a good deal of skilly The tester must e.xer- 
cise considerable judgment in obtaining from each child as clear answers as 
possible, without probing more than the standardized directions allow. Rap- 
port is especially a problem with younger children, who are not habituated 
to tests or to tasks calling for sustained attention. One hour, more or less, is 
required for each .test, although there is great variation from child to child. 

Only those persons should give the Stanford-Binet who have been trained 
in its use and scoring. Terman suggests that an adequate training program 
calls for a general course in mental test theory, a second course in methods 
during which the student tests at least 25 subjects, and further experience 
in clinical courses where the student gives the test to various kinds of sub- 
jects. Training may be considered complete when the person has tested about 
100 cases, beyond the 25 practice subjects. 

The t est include s a great variety of tasks. The range of mental abilities 
tapped is illustrated in 'fable 11. 'The following test, “Verbal absurdities, 
at Year IX, Form L, is representative (44, p. 108) :* 

“Procedure. Read each statement and. after each one, ask ‘What is foolish about 
that?’ If die response is ambiguous, say, ‘Why is it ( that ) foolish?’ 

“(a) Bill Jones’ feet are so big that he has to pull his trousers on over his head. 
“(b) A man called one day at the post oflBce and asked if there was a letter 
waiting for him. ‘WTiat is your name'?’ asked the postmaster. ‘Why,’ said the 
man, ‘you will find my name on the envelope.’ 

“(c) The fireman hurried to the burning house, got his fire hose ready, and 
after smoking a cigar, put out the fire. 

“(d) In an old graveyard in Spain they have discovered a small skull which 
they believe to be that of Cliristopher Columbus when he was about ten years 
old. 


^ Reproduced by permission of Houghton MifHin Company. 
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Table 11. Representative Tasks from Stanford-Binet Test, Form L, Arbitrarily 
Grouped According to Behavior Called For 


Year Information and Past Learning 

.Verbal Abilily 

I(“6 *Points to toy object "we drink out of." .69“ 

*5hows doll’s hair. .43 

*Names chair, key, ... .41 

^Names objects from pictures. .78 

IV *Shows "whal we cook on" in picture. .56 

"'Names objects from picture. .74 

VI *Gives exominer 9 blocks. .56 

"Vocabulary: defines orange, enve/ope. .65 

VIII *“Whal makes o sailboat move?" .66 

r 

*Vocabulary: e/etash, roar. .75 

"'Rnds absurdity in simple story. .70 

X 

*Vocabulary: muzzle, haste. .84 

"'Names 28 words in one minute. .46 

XII 

*Vocabolary: tkill, juggler. .79 

"‘Defines constant, defend. .74 

"‘finds absurdity in simple story. .64 

"‘Completes "The streams ore dry • . 

there has been little rotn." .61 

XIV 

Vocabulary: brunette, pecuitarity. .89 

'^Defines eonsfanl, defend. .83 

i 


” These tests are used if the scale must be abbreviatud. 

“ Numbers are correlations of the test witli the total ,test score, for pupils at the age level where the lest 
is located (27, Table 53). The grouping of tests is based in part on McNemar’s factor analysis (27, pp. 
124-137). 

“(e) One day we saw several icebergs that had been entirely melted by the 
warmth of the Gulf Stream.” 

The child passes this for two months’ credit on his mental age score at Year 
IX if three items are solved satisfactorily. Four correct allow two more 
months’ credit at Year XII, 

Tests call,..£Qr both verbal and non-verbal performances., simple memory 
aiicT complex reasoning, familiar situations to which answers have been 
learned and novel problems where ability to adapt is called for. Those in- 
volving apparatus and pictures are used at younger ages, with an increasing 
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Table 11. (Continued) 


Memory 

. Perception 

Reasoning 

^Repeats “4-7." .62 

Formboard ,46 



Draws missing leg on 

man. .47 

^Matches circles, squares. .72 

“Why do we have houses?" .78 

‘“Reproduces bead chain after 
seeing sample. .58 

• 

Finds missing part of pic« 
tore of wagon. .46 

"“Finds unlike form in row 

of pictures. .73 

Moze .60 

Recalls story read by ex- 
ominer. .70 


^Tells how baseball and orange 
are alike and different. .75 

*Reads paragraph aloud, re- 
calls content. .49 

Repeats “4-7-3-8-5-9" .45 


*Finds absurdity in picture. 

(See Fig. 20.) .38 

*Gives reason why children 
should not be noisy in 
school. .53 

’“Repeats backward, 

••8-1-3-7-9” .50 


Describes picture; must make 
connected story .55 



Finds absurdity in picture. .60 

^Generalizes from paper* 

folding experiment. .75 

^Explains how to measure 3 
pfs. of water with a 7-pt. 
can and a 4-pt. can. .57 

*'Which direction would you 
have to face so that your 
left hand would be toward 
the east?” .57 


reliance on verbal problems through the school ages, and more tests of ab- 
stract thinking at the upper end of the scale. In Table 11, representative tests 
are classified according to an arbitrary scheme. There is no evidence that 
scores on the items grouped together are determined by the same special 
mental ability, although some children show consistent weakness on verbal, 
reasoning, or some other type of problem. Most of the tests involve several 
types of performance; the maze, for example, calls for perception, reasoning, 
and comprehending verbal directions. 

Responses are scored as objectively as possible, with the aid of the scoring 
guide. The guide may be illustrated with respect to the ten-year-level item 
reproduced in Figure 20. The absurdity, of course, is that the white man 
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should not be shooting at the distant Indian when he is in immediate danger 
from the other two. The manual lists the following as acceptable answers: 
“He’s just aiming at one Indian and the others could knock him over the 
head.” “He’s shooting at the Indian that’s far away.” There are five such 
samples given. Sample unacceptable answers include the following; “There’s 
a hunting man, there’s Indians there, they’re going to kill him, he’s standing 
there, he ain’t running.” “These are standing right by him letting him shoot 
the other fellow.” “There’s a man and he’s dressed too modern to be in times 



Fig. 20. "What's foolish about that picture?" — a Stanford-Binet item (44). 


of Indians — got too modern a gun” (44, p. 254). The principal diflBculty in 
scoring tins item occurs when a child does not explain clearly what he be- 
lieves the absm-dity to be. Skillful questioning is needed to obtain adequate 
elaboration in some cases. 

7. Judge the iollowing answers to the problem in Figure 20 right or wong. (These 

answers are taken from the Pintner et al. “Supplementary guide” [36].) 

a. Three Indians are trying to bit this man on head when he tries to .shoot them. 

b. This Indian (pointing to Indian in distance) trying to fight with white man 
(pointing) . 

c. He hasn’t a straight aim at the Indian; Indian will kill him, he should aim at 
first Indians, not back one. 

d. Indian has peculiar clothes. 1 should think man would be shooting at the 
nearest Indian. 

8. Which of the terms introduced on pages IS to 17 apply to Form L of the 

Stanford-Binet in whole or part? 
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What the Test Measures 

A study of the items in Table 11, or, better, of the entire test, is the best 
way of understanding what a Binet IQ represents. Since most subsequent in- 
fsUig§ 0 A<S-tfists have been made to have high correlations with the Stanford- 
Bine t, most statements about the meaning of “Binet intelligence” apply also 
to scores from other tests. 

The reader will probably agree that most of the test items do fit Binet’s 
definition of intelligence, in that they call for ability to maintain a definite 
set, adaptation, and self-criticism. That the items all measure some common 
Semient, which is named "general intelligence,” is demonstrated by the cor-, 
relations between each item and the total test. But a thoroughgoing analysis 
must do more than accept items because they include an element we wish 
to measure. An equally important question is: What elements affect the 
score that are not considered in the definition? Logical analysis plus various 
experimental studies have led to several general conclusions on this matter. 

1. The Stanjord-Binet test measures present ability, not native capacity. 
While it seems^obvious that no test can measure anything but here-and-now 
beh avio r, there has been much cpnfusiqn in the past thirty years because iiu- 
telligence was th ought of as hereditar)'. While there is. necessarily an inborn 
potentialit y, the test measu res only present ability, which is afected.by.both 
innate factors and experiences^ If a user wishes to infer that a difference be- 
tween Binet fesFscof« of two children represents an innate difference, he 
must assume that the two children have had much the same experience dur- 
ing their lives. If a child has had the same opportunity as normal children to 
acquire skills, information, and work attitudes called for by the tests, any 
failure to come up to normal performance must be due to his having failed 
to profit from his opportunities. 

Binet himself never considered that his tests measured an in.nat(? cap acity . 
While it is clear that Tbr”most children rank in intelligence ip fairly constant 
because their environment is fairly constant, one must not assume that the 
IQ corresponds exactly to native capacity. The a bility level can of course b e 
changedTiy mdical changes in environment, although we have found n o gen - 
eral techniques for “mental orthopedics” (Binet’s term ) which will acc elerate 
significantly the mental development of normal children from normal homes. 

It is easy to list variations in experieiicc.tliat would make it easier.fQi one 
child than ano ther , to pass the Binet tasks, even if they had eq ual nat ive abil- 
^ Johnny was ffl during much of the first grade; he is now a poorer reader 
than others. Freddie is only five, but his f ather has p layed number games 
with him so that he can count and add. very well. Harold’s mother did not 
like having her walls marked, so she refused to let Harold use pencils or 
crayons except under her supervision; Harold, at seven, doesn’t seem to enjoy 
drawing and is clumsy at it. Sarah lived in a rural area, where she never saw 
trains or telephones, although their pictures were in her story books. Peter’s 
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parents ai'e refugees; although botli parents can speak English, they find it 
difficult and always use German at home. Frances has a set of children’s puz- 
zle books which include interesting puzzles, pictures containing absurdities, 
and pictures to compare to see if tJiey are alike or different. Variations such 
as these are common in any group and may be counted on to raise or depress 
both test performance and school performance. 

9. List for each of these children some of the tests in Form L which they would 
find easier or harder than children with “normal” experiences. 

10. Wliich items in Table 11 depend upon previous school learning? 

11. Which items would offer an advantage to a child from an upper-class home 
compared to a child from an impoverished working-class home? 

2. Stanford-Binet scores are strongly weighted with verbal abilities. The 
greatmajority of test items call for facility in using and understanding words. 
If the child does poorly at these tasks, he probably will do poorly in other 
verbal activities. He may do badly on the test because of poor schooling, but 
tliis will also cause him to do badly in school in the future. The_Binet test is 
an excellent measure of .scholastic aptitude,, of readmess. to dq.fhe sort of 
tasks, required in school. Since Binet originally sought tests which wouldf 
identify superior and inferior pupils, as judged by their school performance, 
it is not surprising that the final test measures a geirersil ability commo n in 
school work. If one were to examine intelligent acts outside of scho ol, it 
might be found that verbal facility isliot so widely important. Tlie test is n ot 
a measure of all t^es of mental ability;, various writers have commenteS on 
its ‘underemphasis on insight, foresight, originality, organization of ideas, and 
so on. A high score on the test should not be interpreted as guaranteeing 
qualities which the test does not measure. 


Table 1 2. Mean IQ's of Monolingual and Bilingual Children on 
the Binet Test and a Performance Test (12) 



Mean IQ for 

Mean iQ for 



Monofinguals 

Bilinguals 


Test 

(N = 106) 

(N = 106) 

Difference" 

'Stanford-Binet 

98.7 

90.9 

7.8 

Atkins Object-Fitting 

89.0 

97.5 

-8.5 


“ Significance tests show that neither difference could be due to chance. 


Among the pupils for whom tliis verbal loading causes the S tanford-Bine t 
to give an uiifair_picts.ire of i ntelli gence as a whole are bilingual children, 
children from homes where experiences .with English are. limited, children 
with hearing deficiencies, and children who have had difficulty in reading. 
The examiner xan. often identify such cases by^tfaeir rela tive ly superior per- 
formance. on. items not stressing language. Table 12 compares the IQ’s of 
children who speak only English, with bilinguals who speak a second lan- 
guage at home. Both groups were tested on the 1937 Binet and the Atkins 
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Object-Fitting Test, a perforaiance test for preschool children which does 
not demand facility in English. It is evident that the bilingual group, which 
showed superiority on the non-verbal test, would be judged inferior on the 
Binet test. 

12. Suppose one were faced with Binet’s original problem, of deciding whether a 
pupil failing in school could profit from the regular curriculum. If the pupil is 
bilingual, which test would serve better? 

3. The ^anford-Binetjcore represents somewhat different mental abilities 

at different ages. This is a^^enijipm the shift of emphasis in Table 11. 
Early tests call for judgment, discrimination, and attention. Verbal tests and 
reasoning play a smaller part until later years. If all the testiTmeasui’ed gen- 
eral intelligence equally well, this would be no problem. From the evidence 
to date, the simple mental developments of early childhood predict only 
roughly the later emergence of verbal and higher mental abilities. Whil e the 
e yly.le vej&.ttf the test are excellent in testing .a child to be sure he is making 
normal progres s, IQ^s at preschool ages. do not predict accurately -the sub- 
ject’s Tater. standing. 

The clearest study on this comes from Maurer’s work with the Minnesota 
preschool tests, which are similar to the early Binet levels. Maurer followed 
a group from preschool years to late adolescence, and retested tliem to de- 
termine what preschool test items had best predicted intellect at maturity. 
At maturity she used a group test, which is heavily weighted with verbal ma- 
terials, but which is also highly correlated with both Binet scores and school 
success. She found that many items which conelated well with the total pre- 
school scale were poor predictors of later development. Among the poor 
items were pointing out parts of the body, obeying simple commands, com- 
prehension, and paper folding (28 ). 

It is si gnifica nt that^fanctions tested during the stage when they are just 
emerg ing are poor predictors. Tlmt is to say, performance is unstable in the 
ea rly po rt ion s of a learning curve. Anderson argues that vocab ulary is a good 
test„at later a ges. j.ust..b£Oa.use it is a learning based on a long period of en- 
virn ninental sHTniilatin n (2, p. 21). ^Jmrrer.cjM&’ms tjbis.in her summary of 
the characteristics of good predictive tests for preschool ages (28, pp. 85- 
86): “[Good^ tests for younger children make only minimal demands on 
langua ge. They require perception of form and spatial relationships and the 
ability to re produce them. Tliey do not .demand complex motor coordina- 
tions.. They require controlled attention and ability to.persist to a goal. Many 
of them are comparatively independent of training. Tests for older children 
[4-5 years] involve use. of language in relationships which are not often prac> 
ticed aniLponstitute probl em- solv in g situ ations involving the use of well-de- 
veloped tools. Only two tests (response to pictures and naming colors) 
seemed to test skills in the early stages of formation.” 

4. The test requires experiences common to the American whan culture. 
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and is therefore of duhiotfs mine for comparing different cultural groups. 
The Zuni Indians, for example, have a cooperative society most unlike the 
competitive altitudes we tend to encourage. Zuni children have races. But a 
child who wins several races is censured, as he has made others lose face. 
Instead, he must learn to win some races, to show he is capable, and then to 
hold back and give others an opportunity to win. In arithmetic, white teach- 
ers sent Zuni children to the blackboard for arithmetic drills, with instruc- 
tions to do a problem, then turn their backs to the board when finished. In- 
stead, the pupils faced the board until the slowest had finished; then all 
turned. This was to them simply courte.sy; following the teacher’s direction 
would have been exhibitionism. It is easy to see why the typical American 
speed test gives misleading results among the Zuni. A Binet test fares no 
better; the first subject may fail some items deliberately, because he fears'^the 
next child will be unable to answer. All intelligence tests face the same prob- 
lem; tliey are adequate for comparing persons witli similar experience, but 
white Americans would perhaps do badly on a test developed by a Zuni 
psychologist, using questions wliich difierentiated between good and poor 
Zunis. 

The influence of culture on performance is espjciajly well shown by an 
analysis of scores o nlire Goo denpugh Draw-a-Man Test in various Indian 
tribes (17). In this test for young children, the child isas]cecl''f(rdraw the 
best man he can. Children of six Indian tribes were tested. The mean IQ 
was computed by Goodenough’s metliod, which scores the completeness 
and realism of the drawing fairly objectively. For one group of Hopis, the 
mean IQ for boys was 123; for girls, 102. Similarly, other Hopis and Zias 
showed large differences favoring boys. For Navahos, on the other hand, 
the mean IQ’s were boys, 107; girls, 110. Anthropologists explained these 
diflferences, pointing out, for example, that the Hopi boy is stimulated to 
observe and woi'k with his world, while girls’ activities are limited to rou- 
tine household tasks. Zia boys are encouraged to draw from infancy, whereas 
girls rarely draw except in school. Boys are expected to paint animals on the 
house walls at Christmas to encourage fertility, and to otherwise aid in 
ceremonial painting. Girls paint only conventional pottery designs. In Nav- 
aho tribes, however, boys and girls have similar experiences and opportun- 
ities to learn art. The Draw-a-Man showed correlations with the Arthur 
Performance Scale ranging as low as .10 and .20, even though both are re- 
garded as tests of intelligence in the white culture. Obviously, a test is in- 
dicative of innate aptitude only when all persons compared have similar 
backg^nds. 

5. The-Bmst..tes t does not give a reliable measure of separate aspects of 
msMoMiy, Scores, ho wever, are heavily influenced _by, abilities other than 
gene ral in telligence, ft would be helpful if we could divide the Binet test 
into segments and obtain separate’esHmates of verbal ability, information, 
and so on. McNcmar’s factor analysis (27) effectively shows that this is imt 
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possible. He finds that for the average Binet item, the general intelligence 
accounts for about half of what the item measures. The extent to which 
anyLitem. measures “general intelligence” is shown by the correlation be- 
tween the test and the total score (Table 11). The other half of the variation 
bet^en individuals is accounted for by miscellaneous knowledges and 
abilities which are irrelevant to the purpose of the test. Any particular 
specific factor is rarely found in more than one or two items thmefore it is 
not practical to identify these factors in separate scores. Since Binet and 
Terman deliberately sought a great variety of tests, so tliat no one subdivi- 
sion of intelligence would have great weight in the final score, the Binet igsts 
are n eces sarily unsuitable for dia gnosi ng separately the various aspects of 
ability, 

13. What sort of items have the greatest correlation with total test score at levels 
II-6 and IV? What does this suggest regarding the meaning of “general intel- 
ligence” in these scales? 

14. What items have the greatest loading of “general intelligence” at the upper 
end of the scale? What is the meaning of “general ability” at tliat level? 

15. Items having low correlations with total score are naming objects (II-6), re- 
peating digits (X), and picture absurdities (X) . What factors other than “gen- 
eral mental development” might influence these items? 

16. The Stanford-Binet test has been criticized because it contains numerous items 
relating to death and other morbid subjects. What has this to do with the value 
of an intelligence test, so long as brighter pupils pass these items? 

6. The B iaeLscaze is influenced by the subject’s personality and eniQtipml 
habits . In fact, Binet’s description of intelligence involves persistence, flex: 
ibili tv of mental appr oach, and criticahiess, all of which are regarded as 
important in clinical concepts of personality. Among the emotional habits 
which, have an obvious effect on scores are shyness with sEange aidults, 
lack of self-confidence, and dislike for “schoolish” .ta.sks., A self-critical per- 
soh ihay say 'T don’t kirow” because he is dissatisfied with the best answer 
he can formulate; a person less sensitive to niceties may give an answer 
which is passable. A pedantic urge to accuracy may make it relatively easy 
to perform well on memory tasks. Inhibition and fear of being incorrect may 
cause failure on tasks requiring insight and imagination. No matter how 
careful a tester is, there is some danger that a child may fail an item even 
though he could have passed. One shou l d the refore always bear in mind 
that the final test score shows how well the child functioned in comparison 
with' others, in his present state, which may be , markedly affected by emo- 
tionaTeSlIiplicafidns. 

17. In indicating the importance of objective mental testing, Terman says, “I be- 
lieve it is possible for the psychologist to submit, after a forty-minute diagnos- 
tication, a more reliable and more enlightening estimate of the child’s intelli- 
gence than most teachers can offer after a year of daily contact in the class- 
room” (40, p. 204). In which of the following features does the advantage of 
the test over the teacher’s report lie? 
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IX. Freedom from personal prejudice. 

b. Considering more aspcct.s of intelligence. 

c. Considering a basically different trixit. 

d. Observing capacity rathf?r than level of actual performance. 

c. Sampling behavior under a wide range of conditions. 

f. Permitting an exact comparison from child to child. 

g. Permitting an exact comparison of the child with a standard of normality. 

Scoring System 

Binet’s plan of successive hurdles makes it possible to report mental de- 
velopment in a simple and easily comprehended score called the mental age. 
The jnental age corresponds to the level on the scale at which the child 
can just pass the tests. Computation of mental age would be simple if the 
child passed all tests to a certain level and failed all tests after that level. 
Because the failures enter gradually, the mental age is determined by fidding 
credits (usually two months per test) for tests passed at any level. Wherje 
test levels are two years apart, and six tests compose each level, each test 
counts four months; where levels ai‘e six montlis apart, each test counts one 
month. The total credit in months is converted into a mental age in yexirs and 
months (written thus: 5-8 for five years, eight months). 

Table 13. Binet Performance of Six Children 



In Table 13, records of six cliildren are reported. Frank shows very uniform 
ability; when he begins to fail, he fails nearly all tests at that level. His 
basEiI age is six. Four tests passed at Year VII add eight months’ credit; 
one at VIII adds two months. His mental age (MA) is 6 years, 10 months. 
Billy has gi-eater “scatter.” His MA is figured as follows: 


Basal Age 


VII 

4 tesis, 2 

mo. eadt 

VIII 

4 tests, " 

II II 

IX 

2 tests, •' 

If M 

X 

3 tests, “ 

II II 

Tofql 




6 yrs. 

8 mo. 

8 mo. 

4 mo. 

6 mo. 

6 yrs., 26 mo.; 8 yrs., 2 mo. ■= MA 
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'WMejneiitaL age . measures the pupil’s performance, a measure of his 
brightness in relation to normal expectation for his age is helpful. Following 
^ suggestioiLby Stern, Terman applied an intelligence quotient, or IQ, in 
his 19I^e^isjqn. This quotient is based on the definition of mental age. The 
average seven-year-old has a mental age of 7-0. A seven-year-old with a 
higher MA is superior, and one with a lower MA is inferior in mental devel- 
opment. The chronological age (CA) of tlie child is his age from birth. The 
IQ is tlie ratio of MA to CA, thus: 


IQ = 


_ Mental age 
Chronological age 


X 100 = 


MA 

CA 


X 100. 


For Frank, the CA is 6 years, 4 months, or 6.33 years. His IQ is 683 
( 6^%2 X 100) divided by 6.33, or 108. The IQ of the normal child is of 
course 100 . 

A special problem is presented in testing older adolescents and adults. 
The “developmental” idea of mental age is difficult to apply, since perform- 
ance ceases to improve regularly as maturity is reached. There is almost no 
difference in test performance between 15-year-olds, IS-year-olds, and 
21 -year-olds, unless the test calls for items which added experience helps 
greatly in an.swering. Studies have shown that ability to adapt to new situa- 
tions has reached nearly full development at age 16. By a series of inter- 
locking definitions, the mental age and intelligence quotient of an older 
subject are determined as follows: Itjhe _tma.CA._.iL.greater .than 13-0 
(where gr adual slowing of development begins), a correcti on is applied; 
tli e cor rect ed CA i s lower than the true CA. Tlie mental, age of the average 
adult is s et at 15^0. It is impossible to define higher mental ages, attained 
by superior adults, as “the level attained by the average person at age — ,” 
because the average person never rises to these levels. Mental ages above 
15 are arbitiary units designed to keep the distribution of intelligence 
quotients roughly constant into adulthood. The mental age for the older 
subject Js-found by adding to 1 4 (or lower ), as a basal ag e, whatever cre dos 
he earn s fn"* the 'adiilt-level tests.. This mental age is divided by the corrected 
chronological age; for persons over 16, the corrected CA is 15. 


18. Compute the mental ages for the remaining four cliildren in Table 13. 

19. Compute the IQ’s for the remaining five children in Table 13. 

20. A 20-year-old passes the following tests: XIV, aU; Av. Adult, 7 tests out of 8, 
credit 2 months each; Superior Adult I, 2 tests out of 6, credit 4 months each; 
Superior Adult II, 1 test out of 6, credit 5 months each. Find his IQ. 

21. Find the IQ of a 40-year-old who passes the same tests as the subject in Ques- 
tion 23. 

22. The statement, “The mental age of the average American adult is that of a 15- 
year-old” is frequently used in criticizing the quality of radio programs, the 
level of our civilization, or our political life. Discuss the validity of the argu- 
ment. What practical implications may be derived from the facts about adult 
mental age? 
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CHARACTERISTICS OF THE IQ 
IQ as a Measure of Rate 

The intelligence quotient is a convenient way of svimmavizing performance 
from the test. It is particularly helpful because mental age is constantly ris- 
ing, during tlie school years, whereas the IQ is more or less constant. While 
we shall examine the assumptions about IQ constancy in detail a few pages 
on, let us assume, temporarily, that the IQ of a child is actually fixed. Frank, 
whose age is 6-4 according to Table 13, is at present six months ahead of 
average test performance for his age; when he is 12 years old he will have 
increased his advantage proportionally. In the absence of further basis for 
prediction, we may assume that his IQ at that time will still be 108. Jrlis 
MA, when CA is 12, can be estimated by substituting in the equation, 

fQ 100. Then 108 ■ 100, and the estimated MA is 12.96, or 

^ CA 12 

13 years. 

The IQ is useful in comparing children of different ages. Gain or loss, 
relative to the average, is shown in change of IQ. Generalizations can be 
established about the educational and vocational expectancies of children 
of diiferent IQ’s. Because it is ea.sy to use, the IQ is often the only measure 
reported from the te.st. Users must remember that persons with tire same 
IQ do not have the same present mental ability; they have o nly similar 
relative superiority. In any class, under-age children of superior alrility are 
more like average children than they are like children of normal age with 
high IQ’s. Only MA shows the preseirt level of ability of the person. Two 
children of the same MA have the same general level of development, but 
they will differ in the direction of development. Table 14 shows the subtest 
performances of children with the same MA. Young, superior children pass 
different tests at Year IX than do older .subnormal children. 


Table 14. Relative Difficulty of Tests in 1916 Stanford- 
Binet for Children All Having an MA of 9 But Different IQ's 

(29) 



Below IQ 70; 

CA 1 3 or Over 
IN = 48) 

IQ 90-1 1 0; 
CA About 9 
(N = 48) 

Above IQ 1 40; 
CA 6-6 or Below 
{N = 22) 

Year IX 

1 . What is the date today? 

71 

63 

26 

2. Tells which of two weights is heavier. 

47 

47 

91 

3. Makes change. 

83 

67 

31 

4 Repeats four digits backward. 

SB 

67 

54 

5. Uses three given words in a sentence. 

83 

65 

68 

6. Finds rhymes tor day, mill, etc. 

59 

58 

91 


23. Prepm-e a table or graph showing the probable MA’s of the six children in 
Table 13 at the following ages: 4, 0, 11, 13. 
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24. Judging from Table 14, how does the mind of a child of IQ 150, CA 6, differ 
from that of the average nine-year-old? 

Distribution of IQ's 

Figure 21 shows the distribution of scores for a large group of persons 
from ages 2 to 18, used in standardizing the test. Scores approximate a 
normal distribution. The distribution centers around IQ 100, and deviate 



44 54 64 74 84 94 104 114 124 134 144 154 164 174 
I.Q. 

Fig. 21. Distribution of IQ’s in the Terman-Merrill 
standardization group (44). 

IQ’s are about equally common in the superior and inferior directions. The 
frequency of IQ’s at each level is shown in Table 15. Tliis table is the best 
single frame of reference for deciding how high, or how low, a given score 
is in relation to the population as a whole. 

Comparison of cases with a normal group is frequently helpful. In prac- 


Table 15. Percentage Distribution of IQ's in Ter- 
man-Merrill Standardization Group (30) 


IQ 

Percent 
of Cases 

Percent of Coses 
Falling in and 
Above Each Interval 

150-1- 

0.2 

0.2 

140-149 

1.1 

1.3 

130-139 

3.1 

4.4 

120-129 

8.2 

12.6 

110-119 

18.1 

30.7 

100-109 

23.5 

54.3 

90-99 

23.0 

77.3 

80-89 

14.5 

91.8 

70-79 

5.6 

97.4 

60-69 

2.0 

99.4 

50-59 

0.4 

99.8 

Below 50 

0.2 

100.0 
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tical situations, however, we must usually judge how a person will fit into a 
selected group. Selection begins by the time the child enters school, for 
many of the severely subnormal ai’e institutionalized or cared for in the 
home. Through the grades a slow but continuous elimination process oper- 
ates, especially where children are permitted to leave school to work. There 



Fig. 22. IQ's of all pupils entering one high school com- 
pared with IQ's of those dropping out before graduation 

( 31 ). 


is less incentive for the superior cliild to leave school than for the child who 
is frusti’ated in school work. The end result is a gradual raising of the average 
level. Elimination is most severe at such breaking points as the beginning 
of high school or high-school graduation. The IQ range in high school is 
unlike the representative sample studied by Terman and Merrill, and college 
groups are even more highly selected. One of the many studies of the range 
of IQ’s in high school showed the disti-ibution in Table 16. Because a test 
other tlian the Binet was used, the norms are not precisely comparable to 


Table 1 6. Distribution of. Freshman 
IQ's in the High Schools of St. Louis, 
Missouri, in 1 936 (23) 


Henmon-Neison 

IQ 

Percent of 
58,341 Pupils 

140-1- 

1 

130-139 

4 

120-129 

12 

110-119 

24 

100-109 

31 

90-99 

20 

80-89 

7 

Below 80 

1 
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those of Table 15. The median IQ was 106.8, and the dropping o£F of cases 
below 100 IQ is notable. Other evidence of selective elimination is provided 
by Figure 22, which compares the range of IQ’s of 1146 pupils entering 
high school with those dropping during the course. 

Since the range of abilities varies from school to school and from class to 
class, a final judgment of the pupil’s standing must be based on local norms. 
Local norms change from time to time owing to population migration and 
changing school policies. At the college level, Traxler’s data provide an 
important warning against overgeneralizing in interpreting an IQ. He 
studied data from 323 colleges, where the American Council test was given 
to all freshmen. After converting these scores to IQ’s roughly similar to 
Binet IQ’s, he obtained the results given in Table 17. Clearly, some students 
who would succeed readily in one college will be far below average if they 
unwisely select a college where the competition is stiffer (cf. Fig. 17). 


Table 17. Median IQ’s of Freshmen in 
American Colleges (47) 


Median IQ 

Highest of 323 colleges 

123 

Median of all colleges 

109 

Median of four-year colleges 

109 

Median of junior colleges 

105 

Median of teachers colleges 

105 

Lowest of 323 colleges 

94 


25. The mental test scores of high-school students have, on the average, increased 

during the past twenty years, even tliough a constantly increasing proportion 
of the population has attended high school (14). How can tliis increase be 
explained? . 

26. Sketch a chart comparing the data from Tables 15 and 16, and discuss the ex- 
tent to which the high-school group differs from a normal population. 

Meaning of Particular IQ’s 

A frequent device is to label certain groups as “normal,” “near genius,” 
“feeble-rninded,” etc. This, is fallacious, because there is no borderline at 
whichgenius, for example, suddenly appears. S5.mej)ersons of IQ 110 ipalce 
srgnificapt contributions, and some of IQ 160 lead undistinguished adult 
lives. Some adults of IQ 80 are incapable of adjustment to tlie world, and 
some of IQ 60 support themselves and make an adequate home. Some 
classifications of feeble-mindedness have been widely accepted as a starting 
point for thinking about the individual case. lQ^_60--69 may ..he, labeled, 
borderlinei 40=S9,..moronsi . 20-39, imbeciles;, arid beiow, 20, idipts. These 
"afe" convenient, but it is wrong to .think of them as rigid pigeonholes. Clinical 
disposition of a case is always based on a combination of mental test data 
with evidence on the person’s funcUoning in social and practical situations. 
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27. Compute the mental ages of six-year-old children who arc classified as idiots, 
imbeciles, and morons. What does tliis imply regarding the behavior to be ex- 
pected from them? 

28. Compute the mental ages of adults in each of these categories. Interpret the 
results ill terms of the amount of care these adults will need. 

29. In early psychological reports, a child having IQ 75 was sometimes referred 
to as having “three-fourths of normal intelligence.” Why is this misleading? 

30. Freeman (15) lists the imbecile range as 25-50, instead of 20-i39 as reported 
above, based on Bernreuter and Carr’s clas.siflcation (4). Is there a way to de- 
cide which is correct? 

An interesting basis for studying the human meaning of the high IQ is in 
the research hy Catharine Cox Miles, who estimated the IQ of prominent 
historical figures from biographical data indicating their childhood de- 
velopment. “Voltaire wrote verses from his cradle; Coleridge at 3 could 
read a chapter from the Bible. Mozart composed a minuet at 5; Goethe, 
at 8, produced literary work of adult superiority” ( 11, p. 217 ) . The mini mum 
estimated IQ’s which could account for the recorded facts about these men 

Table 1 8. Reference Points for Establishing the Meaning of an IQ 


120 Needed to do acceptable work !n a first-class college with normal effort (16, p. 261). 

1 14 Mean IQ of children in Midwest city, from white-collar, skilled-labor families (18). 

1 07 Mean IQ of hlgh-school seniors (14). 

104 Minimum IQ for satisfactory (I.e., average) work In high school, In academic curriculum (10). 

1 00 Average IQ in unselected population (theoretical). 

93 Median IQ of children in eight one-teacher rural schools In Texas (5). 

91 Mean IQ of children In Midwest city, from low-income, socially depressed homes (18). 

90 Adult of IQ 90 con assemble some parts requiring some judgment, can operate sewing machine 
where threading and adjusting machine is required (9). 

Child of IQ 90 can progress through eight grades with some retardation. With persistence may 
complete high school with difficulty (10, 42). 

70 Adult of IQ 70 can set and sort type, do farm work (3). 

Child of IQ 70 will be able to attain flfdi grade and may do average work there (42). 

60 Adult of IQ 60 con repair furniture, point toys, harvest vegetables (3). 

50 Adult can do rough painting, simple corpmtry, domestic work (3). 

Child above IQ 50 can profit from special classes in regular schools; need not be segregated 
(13) 

40 Adult can mow lawn, handle freight, simple laundry work (3). 


were: Voltaire, 180; Coleridge, 175; Mozart, 160; Goethe, 190. The trye IQ 
might conceivably have been higher, but full evidence was not availabl.ei 
This is perhaps a good place to comment on the reduction of IQ that 
occurs for unusually superior persons as they mature. It is mathematically 
possible for a six-year-old to reach an IQ well over 20O, if he can pass, issts 
through t he 12 -year level. When the same boy is 14, he may pass every item 
in the entire test,, but his MA cannot exceed 22 years, 10 months. Thi.s~*^eil- 
ing efiect” makes_.the...highest possible adult IQ 152. Chance errors will 
iwually reduce the scores of even die Irighest adults soraewIial'&ellJW'thfsr 
As” a highly superior person reaches adulthood, then, his IQ declines, hot 
becaus e of less rapid development, but because of limits imposed by the 
test. Lest this be interpreted as suggesting that superior children lose their 
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real superiority, attention may be drawn to Terman’s studies in which he 
followed children of high IQ through early adulthood. Considering the 
group as a whole, these young adults were found to be in every way superior 
to average men and women of comparable age (43). 

The meaning of the IQ is best understood by one who has great experience 
in observing children of known IQ in particular situations. A partial substi- 
tute for such a background may be gathered from the research literature, 
where various writers have established the IQ requirements of particular 
tasks. Many of these results are brought together in Table 18. These stand- 
ards are not entirely trustworthy, since they are based on different tests and 
on different criteria. They are nevei-theless worth study, since they challenge 
many preconceptions about intelligence. 

31. If an IQ of 104 is demanded by the academic high-school curriculum, do such 
schools provide a suitable education for the majority of children? 

32. What problems may be anticipated if economic trends cause most above- 
average high-school graduates to seek a college education? 

33. The elementary school and the junior high school have been modified in the 
last 30 years, in the direction of a curriculum promoting rounded development 
in artistic, phy.sical, and social .skills — in contrast to the curriculum stressing 
memorization and skill in reading and arithmetic which was in effect when 
Terman (42) made his analysis of the educational expectancy of children with 
IQ 90 and IQ 70 (Table 18). Would this change in curriculum alter the ex- 
pected educational progress of .such children? 

The Equivalence of IQ’s 

Although we hav e so far assumed that an IQ of a given size always rep- 
rejsfints.equallyjhi^ standing, it is i^w nec^sary to consider two limitations 
to this .assumption. The first is that to some extent, Stanford-Binet IQs 
are. not comparable at different ages; the second, that IQ’s horn different 
t ests are not~ pCTfeetTy interchangeable. 

As we'saw mTTlHapTef '3, the most meaningful way to interpret a score is 
to study its relative position in the group. When the spread of IQ s at dif- 
ferent ages is determ ined for the Stanford-Binet, it is found that there is 
■smTie variatio n. On the whole, the standard deviation of the IQ is 16 or 17 
points, but at a few ages Terman and Merrill found much higher or lower 
sigmas. In the standardization group, the standard deviation dropped to 14 
at age 5, and 12.5 at age 6; it rose to 19 or 20 at ages 2/z, 3, 12, and 15 (44, p. 
40). These irregular .fluctuations are probably due to peculiarities of the 
scale (27, p. 85). If a score two standard deviations below tlie mean rep- 
resents the same degree of mental retardation at all ages, an IQ of 60 at age 
3, 12, or 15 is equivalent to an IQ of 72 to 75 around age 6, and equivalent 
to an IQ around 67 at most other points on the scale. Further research on 
this problem is needed. 

Wlren we have learned what to expect of a child whose Stanford-Binet IQ 
is 80, we tend to transfer that expectation to another child for whom we have 
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a Wechsler or Otis IQ of 80. This is not justified because diflerent tests are 
standardized on different groups and IQ’s are often computed by formulas 
different from the Binet procedure. Sometimes these differences are note- 
worthy. One group of college freshmen, for example, had a median Binet 
IQ ( Form L ) of 129, but on the Wechsler test their median was only 119 ( I ) . 

Error of Measurement 

Naturally, no test score is completely dependable. The extent to which 
the IQ is affected by errors of measurement is shown by reliability coeffi- 
cients. When the precision of the IQ is determined by comparing Form L 
and Form M IQ’s obtained a few days apart, a correlation of about .91 is 



M 1.0. 

Fig. 23. IQ's obtained by seven-year-olds when tested on both 
forms of the Stanford-Binet (44). 


obtained (44, p. 47). Tins establi.shep the Stanford-Binet as one of the most 
reliable of all te.sts.. Even so, the average shift of IQ’s from form to form 
is substantial; 5.9 for IQ 130, 5.1 for IQ 100, 2.5 for IQ’s below 70. This 
means that an IQ of 180 may be 12 to 14 points from an estimate made for 
the same child a few days later, altliough such errors are infrequent. The 
best method for visualizing the error of measurement is to study the scatter 
diagram for seven-year-olds, reproduced in Figure 23. We see, first, that 
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the test is more precise for low IQ’s; changes below IQ 80 are slight. Next, 
we notice a few marked shifts, despite the general agreement between 
the two measures. For those with IQ 95-99 on Form L, the Form M esti- 
mates range from about 87 to about 112. 

34. What range of Form M IQ’s is foimd for children eaiming 130-134 on Fomi L? 

35. There are 17 cases having Form L IQ’s of 125 and above. Their median IQ on 
Form L is about 134. How many of them earned a higher score on Form M? 
How many shifted to a lower class-interval? 

36. Twenty-one children had IQ’s between 95 and 99 on Form L. What interpre- 
tation and recommendations would be made for them on the basis of this? 
Would the interpretation for any child be changed if his Form M IQ were 
used instead? 

37. What is the largest change of IQ in the chart? 

Figure 23 provides an illustration of the concept “regression toward the 
mean.’’ A pe rsonas earned score on a test is his tine score, plus or minus 
some chance error. The true score is the average we would get if we tested 
him over and over, ironing out the unpredictable variations. Of course 
“good luck” makes the raw score higher than the true score, and “bad luck 
makes the raw score lower. Then if we find a person with a very high score, 
more likely than not part of tliat score is due to good luck. If we test him 
again, we can expect him to slip toward his true score, since chance rarely 
favors the same person continually. The general rule for regression is: 
scores above average will more often decline than increase on retesting; 
scores below average will tend to increase on retesting. When a person is 
fo und to have an IQ of 150, then, his “true” IQ is likely to be a few points 
lower. How far scores shift on retesting depends on the reliability coeffi- 
c ient. 

Stability of Scores 

A large number of studies have been made, especially with the 1916 test, 
to determine whether the IQ is “constant.” If the IQ is fairly stable, educa- 
tional and vocational courses can be plotted long in advance, but if the IQ. 
changes from year to year, such long-range decisions will be undependable. 
Evidence from numerous studies with the Binet tests leads to the following 
general conclusions: 

1. IQ’s are generally stable, but the correlation is far from perfect 

2. 'Tests at early ages give less accurate predictions than tests given after . 
age six. 

3. Predictions are somewhat less accurate over very long time intervals 
than over shorter periods. R. L. Thorndike reviewed 36 coirelations, and 
concluded that there was a regular trend. For immediate retests, r .89; 
with a 20-month interval, r = .84; after five years, the correlation dropped 
to .70 (46). 
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Figure 24 shows what this instability means for individual cases. It demon- 
strates the change in scores found upon retesting of behavior-problem chil- 
dren with the 1916 revision. The average time between testings was 15 
months. Brown comments on these data as follows (7, p. 348): 

“Although the correlation between the ratings on one test and another on the 
Stanford-Binct is high, a large number of ca.scs make considerable change, and 
from the clinical point ol view the.se are often the important cases. One hundred 
eight cases or 15.2 per cent change eleven points or more. To say that the average 
change is about five points docs not help a great deal, because in dealing with clin- 
ical cases one can never be sure that the particular case under observation tmiy no,t. 
be one that will show a large amount of change. It would seem advisable therefore 
to segure at least two ratings wherever an intelligence rating is especially impor- 
tant in disposing of the case or in making recommendations.’ 

Another study of similar children over an even longer time found- that 3 
percent changed more than 30 IQ points and 10 percent changed 21 to 30 
points (8). 



Decrease Increase 

Change of I.Q. from First to Second Test 


Fig, 24, Changes in IQ when 1916 Stanford-Binet is repeated 
after an average interval of 15 months (7), 

Some shifts in IQ are due to errors of measurement, and sQipe are dli e-to 
jmtjor changes in the cliild Jiim&eKresiilting j^nL.pLEjysi9al disorders 
c ovenes . changed emotion al adjustine nt . or a ma ce-stinuilating cnviroiiment- 
"’There is also evidence that the rate of mental deve lopment nudprgne^ sgrnp. 
change as the individual mature s, even when extern al c onditions se emJairly 
const ant.~Fnr this reason, it is unsound practice to rely on mental tesii£,.giyen 
se veral years prev iously. Extreme .Tevcrsids, from IQ 70 to IQ 120, arexaie, 
£ut s hifts of 20 IQ points are found fqr^aiew cjiildrgn in most large groups. 
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Scores of emotionally disturbed or uncooperative children are especially un- 
trustwortliy. Wlien the maladjustment is more than temporary, the child’s IQ 
and his general performance will also be constant at a level below his ca- 
pacity. But if the causes of emotional disturbance are remedied, drastic 
changes in IQ occur. Long-range planning pn the basis of the IQ is justified 
so Jong as two cmitiqns are obsei-ved; Interpretation must consider die ele- 
meats.in.the child’s background, which would tend to raise or lower scores, 
and all^ judgments must be made tentatively, leaving the way open for a 
change of plans w hen evidence of change in development appears. 

The case of Danny reported by Lowell, from which the following account 
is condensed, should make clear the hazards that await the psychologist who 
ti'eats every IQ as immutable (26). 

Danny R. was born January 15, 1929. He entered kindergarten at the age 
of five years and was found to be such a misfit in kindergarten that after a 
few weeks he was given a Binet test. The following are records of the four 
tests given before the end of Grade VI, with tlie date of test, mental age, 
and IQ. 

2-2-34 MA 4 yrs., 2 mo. IQ 82 
5-9-35 MA 6 yre., 2 mo. IQ 98 
5-8-37 MA 9 yr$., 4 mo. IQ 11 1 
1 2-3-40 MA 1 5 yri., 9 mo. IQ 1 32 

The first Binet showed such mental immaturity that the child was excluded 
from kindergarten for a year. The next year Danny moved into another 
school district. This time, on a Binet, he seemed nornial, so he was placed in 
the first grade in September in spite of his lack of social adjustment. The 
teachers complained that Danny seemed to live in a little world of his own, 
was noticeably poor in motor coordination, and had a worried look on his 
face most of the time. The mother was called in and only then was light 
thrown on his peculiarities. 

The mother explained that while Danny was still a baby his father had 
developed encephalitis. In order for the mother to work, they lived in the 
grandparents’ home, where Danny could be cared for. Danny’s grandfather 
was a high-strung, nervous old gentleman who was much annoyed by the 
cliild's noise and expostulated so violently at times that Danny became petri- 
fied with fear. The grandfather’s chief aim was to keep things quiet and 
peaceful at any cost. When Danny was excluded from kindergarten the 
mother took him from the grandparents’ home. 

The next few years were a period of educational, social, and emotional 
growth for the starved child. He amazed his teachers with his achievement. 
He became an inveterate reader and could solve arithmetic problems far be- 
yond his grade level. He was under a doctor’s care much of tire time and 
was also treated by a psychiatrist because of his marked fears. He made 
friends with boys in spite of lack of physical prowess. 

Except for cases of specific physical or psychiatric defects, it was once 
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thought virtually hopeless to ti-y to raise the IQ. lii recent years, more. and 
more studies have provided evidence that radiccilltj improving environments 
will raise the IQ in some cases. Not all these studies have been soundly con- 
ducted, but the general conclusion is certainly warranted that theJQ ca n 
often be changed if we tiy. When the IQ, i.e., present test performance, is 
changed, the person is presumably making better use of his capacity. No one 
claims that capacity itself is altered. The most striking recent claim is that of 
Schmidt, who reports gains in average IQ from 52 to 89 for children. given 
special treatment (37). Many psychologists consider such marked changes 
improbable, and further investigation and confirmation of Schmidt’s study is 
called for. Schmidt’s work has received severe criticism for technical flaws in 
the research. 

38. Do you think it correct to say that Danny’s intelligence changed? 

39. What explanation can you offer for the relative inaccuracy of prediction from 
preschool IQ’s? 

EMPIRICAL VALIDITY OF THE STANFORD-BINET 

Tlie Binet scale sajnples a variety of behaviors that are obviously impor- 
tant, and the score is a good predictor of future performance in the test itself. 
For practical use_of tlie test it is important to know how closely the~be: 
haviors shown on it are related to external criteria. That is, how much are 
life success, scho"dl success, and so on influenced by “intelligence as measured 
by the Stanford-Binet test”? 

School success is most easily and mo.st frequently studied. In every case, 
one reaches the same conclusion. TIig Binet IQ predicts achievement suffi- 
ciently well to be useful, but many otlier factors influence achievement, so 
that predictipns must be tentative. The following correlations, for all tmith- 
graders in one high school taking the subjects listed, are representative (6): 

Form L IQ with reading comprehension .73 ) 

with reading .speed .43 ' 

witli English usage .59 ■ 

witli history .59 

with biolo^ .54 t 

“ with geometry .48 ! 

Correlations of the same magnitude are found at the elementary level. 

40. From Figure 7, what improvement over chance prediction can be obtained 
using Binet IQ as a predictor of success in English or hi.story? 

41. The correlation of Binet IQ’s with grades of medical school seniors was found 
to be only .15 in one study. The average IQ of these men was 131 (32) . How 
can you explain the lowness of the conmation? 

A vast body of literature shows mental ability to have significant relati on.s 
to delinquency, vocaBonarsuccess, and so on. But it is recognized that h> 
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telligencejs only one facet of individuality to be considered in a practical 
decision about a child or adult. One can neither predict behavior of a given 
case knowing only his IQ, nor make an intelligent prediction without using a 
good estimate of his mental ability. 

DIAGNOSIS ON THE STANFORD-BINET 

Children with the same MA are of course far from alike in mental develop- 
™nt. This is shown by the fact that they pass quite diJBFerent tests. The 
Stanford-Binet, as a varied standardized situation, brings to light far more 
individual differences than the single score represents. Experienced testers 
always study such differences, and many have tried to develop supplemen- 
tary systems of scoring to report this information. In particular, many have 
hoped that the scatter of performance would have diagnostic value. The 
the range from the child’s earliest failure to his highest success; it 
suggests whether all aspects of ability have developed evenly. After many 
studies of scatter, mvestigatqrs now agree_ that it has no value as a score. All 
other attempts to obtain supplementary diagnostic scores from the Stanford- 
Binet have similarly failed. 

The Binet test fails to yield diagnostic scores because it was designed to 
avoid permitting any factor save “general ability” to influence scores to a 
measurable degree. We cannot trace accurately the child’s development in 
simple recall, for example, because die digit-span tests and others like them 
are not uniformly spaced at all levels of difficulty. Even within one year- 
scale, we cannot discuss the child’s strengths and weaknesses with confi- 
dence, because tests grouped together do not have exactly the same diffi- 
culty. Nevertheless, the clinician is deprived of valuable clues if he does not 
study the pattern of test performance in detail. 

Om cannot measure a “verbal IQ” readily, but if a child has an unusual 
handicap or facility in verbal tests, the examiner has an excellent opportunity 
to note it. Deficiencies in information, arithmetic skill, and reasoning may 
also be noted. iTiese indications, even if brought to ligbtonly..iu ODe.,QJ' two 
subtests, provide profitable leads for further study and confirmation by re- 
liabl e tests of thejseparate abilities. 

The Binet also affords an excellent opportunity to see how the child works. 
The observer should note, for example, when an impulsive child uses trial 
and error in an attempt to “force” a solution instead of using reasoning. An 
inhibited child may refuse to take a chance on items where induction or 
imagination is called for and he cannot be positive that his answer is right. 
A distinction should be made between the child who is successful because of 
coaching, who does well on such teachable items as counting to 13 or saying 
the days of the week in order, and the more genuinely intelligent child who 
can make up a coherent story about a pictm-e and tell what day of tlie week 
comes before Tuesday. 'Diere are ereat ^fferences in the ways children fail, 
.and the ways they react to failure. Spme fail by not respgndmg, while others 
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give answers even to questions on which they are ignorant. Some do not 
know when their answers are incorrect, while others are clearly dissatisfied 
and aware they have met trouble. 

The analysis resulting from a careful clinical study is illustrated by the 
tester’s comments regarding John Sanders (CA 12.8; IQ 109), a normal ado- 
lescent ( 24, pp. 91-92 ) : 

“John showed .'i lively intellectual curiosity and was interested in a variety of 
things, but within each of these interests his attention seemed to be rigid and 
single-tracked. This lack of flexibility made it difficult for him to adapt to require- 
ments when on unfamiliar ground. Upon encountering difficulties, he frequently 
demanded a pencil, because he could not 'see’ the words or numbers; I have never 
tested a more eye-minded person. 

“John’s principal difficulties were on tests requiring precise operations, as in the 
use of numbers. With such tests he became insecure and often seemed confused, 
with slips of memory and errors in simple calculations. He asked to have instiuc- 
tions repeated, was dependent on the examiner, and easily discouraged. Although 
cooperative and anxious to do well, it was extremely hard for him to master a task 
(such as 'memory span’) in which he was required to be exact by fixed standards. 
If this is also true outside the testing .situation, it is not siuprising that in his school 
work he has found gre.at difficulty in learning to spell, in mastering the mechanics 
of English, and in learning a foreign language. We cannot tell from this test m/ij/ 
he has had .such unusual difficulty in this kind of learning. However, the supposi- 
tion can be offered that in tasks involving an imaginative and analytic approach he 
imposes form upon himself; in ta-sks of the type which he finds difficult, form is 
imposed upon him from without. Resistance to such controls may account in part 
for the discrepancies between John’s actual intelligence and his achievement in 
certain fields.” 

_Cliaician5.1lnd that the test directs attention to possible abnormal devia- 
tions^ Schizophrenics in particul|ur give certain distinctive responses'~(2Trp. 
984). Cpinpared to norinals of the same mental age, schizophreni cs are su- 
peri or in v ocabulary, abstract words,. mid dissected sentences, but haye much 
* greate r_difHcultv wit h bead chains, picture absurdities, memory for desig ns, 
3-D d memory fo r stories (34} , Because many normals also show these specific 
difficulties, the pattern of passes and failures caiinot be relied upon as an 
absolute indicator of psychosis.^ a pr eliminary indication to be cliecked" 
with further tejsts, the Binet is of some value in psychiatric diagnosis ^6r 
. young sufejefits. 

.Still j uiQthsLJiechaique of squeezing the test of its full contribution is to 
w5tebJ.or h}dications.of .yd attitudes, and outlook orulife. Strauss reports 
remarks of this nature (39). To the question on the 10-year-level, “What 
ought you to do before undertaking .sometliing very important?” delinquent 
mentally defective children answer, “Don’t touch anything that doesn’t be- 
long to you,” or “Run away from a guy who is going to take it. Go tell him 
nothing of the people that owns them.” In defining pity, one of them answers, 
“Don’t take pity on somebody, shoot them and kill them.” 

Essentially the StanfordrBine t is a standardized elinical observation. The 
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fact that it also yields an IQ should not blind the tester to his obligation to 
rep{jrt-to-those '®hg wilj^ use the test everything he can observe. There is no 
adequate rationale for making and interpreting these observations, and they 
are necessarily tentative. But to avoid them because they are subjective is no 
more sound than if the clinician were to refuse to have a conversation with 
the child because it would not lead to a statistically manageable and reliable 
score. The Bi net t ester, with adequate experience-has a great advantage, over 
the clinical intervi ewer because he can observe the child in a standardized 
situation and can compare what he does with the. behavior of other children. 
The fact that the child does not think of a mental test as a situation revealing 
emotions and habits of work is a further asset. 

42. What sort of report should be placed in the school files for a child who has 
been given a Stanford-Binet test? 

EVALUATION OF THE STANFORD-BINET 

The wide acceptance the Stanford-Binet Scale has had should not obscure 
the fact that the test has been fou nd inyerfect by iH^JLcritms. Criticism 
attended Binet’s first published scale and each subsequent revision. Criticism 
of even die 1937 revision ranges from minor suggestions regarding details to 
Stoddard’s scornful statement, “The Iowa workers . . . feel that, over the 
years, the Stanford revisions have offered not very reliable measurements of 
functions not very close to intelligence” (38, p. 116). The wide acceptance 
the test has had reflects the fact that at each point in its development it was 
more serviceable than other available techniques. Its limitations are real, and 
better tests may supplant it some day. 

The scale is not the n mst convenient intelligence, .test, because it requires a 
tr ained administra tpi- and scorer, and much tim e p er pupil. Tins Jack of econ.-' 
omyjs compensated for by.greater pre cision and greater v a riety o f tasks tiian 
other-te sts o ffer, and by the wealth of ohservatipnal dat a that c an be ob- 
tairjfid-durin^estin_g. The test appeals to children, and is in that resp^T^sjT 
to give. The cumulated experience and research literature on the test mak e 
res ults more usa ble than those of anynew test could be, which will make it 
difficultior .even ,a better test to supplant the Stanford-Binet. 

The test i.s accurat e, except in those cares'in" which rapport is poorly estab- 
lished. The norms and the comparability of parallel forms are not open to 
serious question. The major controversies center around it s validity,, and it is 
of-coH-Mfl-trefe-e giially valid for STpurposes . It yi elds a faulty picture of all- 
mn nd -^^ 0 44 ^? fnr cnmp cbildTc n but research has identified most of the fac- 
tors which make the IQ misleading. If a worker draws false conclusions 
about the capacity of bilinguals, Negroes from an impoverished environment, 
or children rejected by their parents, the error lies in the interpreter. The test . 
is not valid as a measure of capa citv,..miless a Iargajnumber.. nf assumptions 
about past e xperie nce are sati.sfie 4. As a_measure,of ability, the test serves 
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very welK Soitie workers are critical because the test is heavily loaded with 
verbal tasks and tasks influenced by educational experience. This is a weak- 
ness for some applications, but for any attempt to predict the success of a 
child in school it is an advantage. 

Many workers would prefer to have a diagnostic test which yields depend- 
able scores representing the separate facets of mental development. They 
consider the Binct a hodgepodge, which of course it is. The diilerence be- 
tween those who consider this an asset and those who wish to rectify it is in 
miniature the problem of all psychological research. In the hi.story of psy- 
cliology there has been a continual divergence between those who wish to 
measure precisely whatever they measure, and to know just what they are 
measuring even if it is of no practical importance, and those who wish to 
solve a practical problem by treating it as a complex whole with indefinable 
and inseparable elements. Binet’s success, in contrast to his predecessors in 
mental te.sting, came because he found that in his time intelligence could 
not be successfully subdivided. Whether the later development of psychol- 
ogy has armed us with concepts and techniques which will make analytical 
tests fea.sible is now under investigation. The tests discussed in subsequent 
chapters are moving in that direction. 

JThe-agg-scale method and the IQ, which have often been re garde d as sec- 
ond only to the concept of gehefariutclligence itself as a contribntion'^ tliis 
test are und ofgdihg increasing crilicism. The first limitation cif the,age gcale 
,is its inefflciency. Tests arc pi-escnted in a “spiral” order, so that one goe.r 
through all types of test at one difficulty level, then mounts to another level 
and. uses similar te.sts. If all te.sts of a given type, such as digit span^ were' 
given at once, testing time could be reduced. The pass-fail method of scoring 
is inefficient because it throws away information, There is a difference be- 
tween the cliild who gets none of the verbal absurdities correct (p. 109) and 
the child who solves two of them; but by the hurdle .system of scoring the 
child must pass three before the item adds to his score. Most critics ,o£_tlie 
Stanford-Binet have argued for a point system of scoring which would per- 
mit reliable measurement with fewer items. Numerous revisions using point 
scales have been produced, in which one or more points are allowed for each 
correct answer. Total point scores are converted into mental ages, so that the 
same interpretations are made as from the age scale. 

Hutt challenges the basic notion of the test as a standard procedure. He 
epntends^ that the patterns of success and failme of individuals cau&olmstra- 
ti ou at different points, which makes their motivation different. He proposes 
instead to vary the test order with each subject by alternating ea sy a nd hard 
items and thus reducing the cumulative impact of failurq at t he end ^of the 
test. He hopes to obtain better diagnostic information. In an experimental 
ti'ial with comparable groups, he found diat his “adaptive” .metli^-yielded 
XO’s sinrila r to the normal test, forvery well adjusted children. The mean 
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for veiy poorly adjusted children was 4.5 points higher with the adaptive 
method (22). 

43. How could you decide whether Hurt’s procedure is more valid than the stand- 
ard method? 

44. What arguments can be advanced against Hurt’s proposal? 

The attack on the IQ as a niethod of scoring has been based on many 
weaknesses. The variation in jange of fQ’s from year to year along the scale, 
the la^ck-of-comparability with tes& of other functions, and the artificiality 
of the concept at adult ages have beenjjointed out. While the logic of tlie IQ 
is defensible, its popularity has led to a rash of quotients — achievement quo- 
tients, educational quotients, personality quotients, and so on — some of 
which are meaningless and all of which present statistical difficulties. Mar^ 
work ers who developed intelligence tests have devised synthetic IQ'.s which 
ai'e not quotients at all and are called'by that name only becau.se test users 
have been accustomed to expect an IQ from an intelligence test. Because 
standard scores are not open to any of Sie above objections, and would per- 
mi Fdinec t poniparisoh of standing iri'intelligence witli standing in other abili- 
tipc fnr wliiri b there are no quotients, most statisticians urge that standard 
scores bejused with future testSj„the JQ being discarded. Terman argues that 
altliougli standard scores will do all the IQ does and are superior for re- 
search, teachers, social workers, and physicians have learned, over the years 
since 1916, what an IQ of a given size means (44, p. 27). This argument is a 
little weakened by the fact that an IQ of a given size has a different meaning 
on the 1937 test from what it had on the first test and a different meaning at 
different age levels of the 1937 test. But It is true that iOssushift-toustaDdaid. 
sc ores it y vlh take many years before the scoring is understood by as mariy 
p eople as ca n interpret at least crudely an intelligence quotient. 

45. Prepare a table of T-scores (see p. 32) for the Stanford-Binet test at ages 8, 
12, and 15, for a representative group of school children, using McNemar’s 
data below (27, pp. 32-33). 

Age 8 12 15 

Mean MA 8.2 12.4 14.7 

Standard deviation 1.2 2.3 2.7 

46. If, in 1916, Terman had weighed the comparative advantages of introducing 
T-scores or IQ’s, what arguments could have been advanced for the IQ? 

47. With standard .scores, how would adult performance be reported? Must any 
assumptions l)o made about the age when mental growth ceases? 

48. If we had an accurate test, why cwild we not report color vision in terms ot 
“color-vision age” and “color-vision quotient”? 

49. The MA eompin-es the subject to the group who are equal with bun in mental 
development but not in age; the standard score compares him to those ot his 
own age, but with different development. With which group will he most often 
be competing in life situations? 
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Another set of argiiments has been set oflE by questions about the shape of. 
the mental growth curve. The diflPerence between MA IS and MA 16 is quite 
small; people with either MA can do about the same things. The difference 
of one year in MA is quite large at early ages, in terms of what a child can 
do. Moreover, it is not accurate to assume that mental growth ceases utterly 
near age 15. Heinis proposed a scaling system in which equal score incre- 
ments would represent equal amounts of mental growth. This in turn led to 
a "personal constant,” or PC, wliich he would substitute for the IQ (19). 
Such methods appear to offer no advantage over the standard score for prac- 
tical purposes. 

Several revisions of the Binet other than Tennan’s have been developed. 
Characteristics of the most important ones are summarized in Table 19. 


Table 1 9. Representative Tests Derived from the Binet Scales 


Test and 

Publisher Author 

Date 

Age 

Range 

Special 

Features 

Comments of 
Reviewers” 

Stanford Revision Terman 
(Houghton Mif- 
flin) 

1916 

3 — adult 

Introduced the IQ. Most 
widely used, 1916 
to 1937 


Stanford Revision, Terman- 
Forms L and M Merrill 

(Houghton Mif- 
flin) 

1937 

2 — adult 

Parallel forms. Ex- 
tended for higher 
and lower ages. 

Most widely used, 
1938 to 1949 

“Much superior to the 
old [1916] for survey 
purposes. . . . Many 
shortcomings for 
clinic use" (25). 

Point Scale (Wor- Yerkes- 
wlck and York) Bridges 

1923 

3 — adult 

Sublests organixed as 
point scale. Of mainly 
historical interest 


Tests of Mental Kuhlmann 

1939 

3 months 

Uses Heinis' scaling and 

"Rather intricate. . . . 

Development 
(Educ. Test 

Bureau) 


— ^dulf 

point scoring 

In the hands of quali- 
fied workers, highly 
reliable, stable, and 
accurate" (20). 

Detroit Tests of Baker- 

Learning Apti- Leland 

tude (Pub. 

School Publ. 

Co.) 

1935 

4— adult 

Tests grouped for di- 
agnosis. Point scale 

“standardization is in- 
adequate. . . . The 
whole Interpretation 
lacks statisileal foun- 
dation” (33, p. 1 29). 


” The render should refer to the houree cited for a full stntement of the reviewer's oxiiniun. 


SUGGESTED READINGS 

Brown, Elinor W. “Observing behavior during the intelligence test,” in Lerner, 
Eugene, imd Murphy, Lois B. (eds.), “Methods for the .study of personality 
in young childrens Monogr. Soc. Res. Child Developm., 1941, 6, No. 4, pp. 
268-283. 

Doll, Edgar A. “Note on the age placement of year-scale tests.” /. consult. Psychol., 
1947, 11, 144-147. 

Kent, Grace H. “Suggestions for the next revision of the Binet-Simon scale.” 
Psychol. Rec., 1937, 1, 409-432. 

Murphy, Lois B. “The appraisal of child personality.” J, consult, Psychol., 1948, 
12, 16-19. 

Poffenherger, A. T. “The role of intelligence in adjustment," in Principles of ap- 
plied psychology. New York; Appleton-Century, 1942, pp. 282-304. 






THE BINET SCALE AND ITS DESCENDANTS 137 

Stoddard, George D. “The background of mental testing” and “Translations and 
revisions of Binet tests,” in The meaning of intelligence. New York: Macmillan, 
1943, pp. 79-99, 100-126. 

Teagarden, Florence M. “The child’s intelligence,” in Child psychology for pro- 
fessional workers.^ New York: Prentice-Hall, 1940, pp. 387-428. 
ferman, Lewis M. ‘The vocational successes of intellectually gifted individuals.” 
Occupations, 1942, 20, 493-498. 

REFERENCES 

1. Anderson, Edward E., and others. “Wilson College studies in psychology: 

I. A comparison of the Wechsler-Bellevue, Revased Stanford-Binet, and Amer- 
ican Council on Education tests at the college level.” /. Psychol., 1942, 14, 
317—326. 

2. Anderson, John E. “Freedom and constraint or potentiality and environment.” 
Psychol. Btdl, 1944, 41, 1-29. 

3. Beckham, A. S. “Minimal intelligence levels for several occupations.” Person. 

J. , 1930, 9, 309-313. 

4. Bernreuter, Robert G., and Carr, E. J. “The interpretation of IQ’s and the 
L-M Stanford-Binet.” J. educ. Psychol, 1938, 29, 312-314. 

5. Blanton, Annie W. “The child of the "rexas one-teacher school.” Univ. Texas 
Bull. 1936, No. 3613. 

6. Bond, Elden A. Tenth grade abilities and achievements. New York: Teachers 
Coll., Columbia Univ., 1940. 

7. Brown, Andrew W. “The change in intelligence quotients in behavior prob- 
lem children.” /. educ. Psychol, 1930, 21, 341-350. 

8. Brown, Ralph R. “The time interval between test and retest in its relation to 
tlie constancy of the intelligence quotient.” J. educ. Psychol, 1933, 24, 81-96. 

9. Bmi-, Emily T. “Minimum intellectual levels of accomplishment in industry.” 
/. person. Res., 1924, 3, 207-212. 

10. Cobb, M. V. “The limits set to educational achievement by limited intelli- 
gence.” /. educ. Psychol, 1922, 13, 449-464, 546-555. 

11. Cox, Catharifie M. Genetic studies of genius. II. The early mental traits of 
three hundred geniuses. Stanford Univ.; Stanford Univ. Press, 1926. 

12. Darcy, Natalie T. "The eficect of bilingualism upon the measurement of the 
intelligence of children of preschool age.” J. educ. Psychol, 1946, 37, 21—44. 

13. de Latour, Patricia M. The borderline child. Unpublished Master’s thesis, 
Univ. of Otago, Dunedin, N.Z., 1944. 

14. Finch, F. H. “Eimollment increases and changes in the mental level of the 
high school population.” Anpl. Psychol. Monogr., 1946, No. 10. 

15. Freeman, Frank N. Mental tests. Boston: Houghton M ifflin , rev. ed., 1939. 

16. Gates, Arthur I., and others. Educational psychology. New York: Macmillan, 
1942. 

17. Havighurst, R. J., and others. “Environment and the Draw-a-Man Test: the 
performance of Indian children.”/, abnorm. soc. Psychol, 1946, 41, 50-63. 

18. Havighurst, Robert J., and Janke, Leota L. “Relations between ability and 
social status in a midwestern community. I; Ten-year-old children.” J. educ. 
Psychol, 1944, 35, 357-368. 

19. Heinis, H. A. “Personal constant.” /. educ, Psychol, 1926, 17, 163-186. 

20. Hildreth, Gertrude. “The Kuhlmann Tests of Mental Development.” J. consult. 
Psychol, 1939, 3, 128-130. 

21. Hunt, J. McV. (ed.). Personality and the behavior disorders. New York: Ron- 
ald Press, 1944. 



138 ESSENTIALS OF PSYCHOLOGICAL TESTING 

22. Hutt, Max L. “A clinical .study of ‘consecutive’ and ‘adaptive’ te.sting with the 
Revised Stanford-Binct.” /. consult. Psychol., 1947, 11, 93-103. 

23. Johnson, George R. “High school survey.” Puhl. Sch. Messenger, 1937, 35, 
No. 4, 1—34. 

24. Jones, Harold E., and others. Development In adolescence. New York: Apple- 
ton-Century, copyright, 1943. 

25. Krugman, Morris. “Some impres.sions of the Revised Stanford-Binct Scale.” 
J. edne. Psychol., 1939, 30, 594-603. 

26. Lowell, Frances E. “A study of the variability of IQ’s in retest.” /. appl. 
Psychol, 1941, 25. 341-356. 

27. McNemar, Quinn. The revision of the Stanfard-Binet scale. Boston; Houghton 
Mifflin, 1942. 

28. Maurer, Katlrerine M. Intellectual status at maturity, as a criterion for select- 
ing items in preschool tests. Minneapolis; Univ. of Minnesota Press, 1946. 

29. Merrill, Maud A. “On the relation of intelligence to achievement in the case 
of mentally retarded children.” Comp. Psychol. Monogr., 1924, 2, No. 10. 

30. Merrill, Maud A. “Significance of IQ’s on the revised Stanford-Binet scales.” 
J. edne. P.sychoL, 1938, 29, 641-651. 

31. Mitchell, Claude. “Prognostic value of intelligence tests.” J. educ. Res., 1935, 
28, 577-581. 

32. Mitchell, Mildred B. “The revised Stanford-Binct for university students.” J. 
educ. Res., 1943, 36, 507-511. 

33. Mursell, James T. P.sychological testing. New York: Longmans, 1947. 

34. Myers, C. Roger, and Gilford, Elizabeth V. “Mea.suring abnormal pattern on 
the Revised Stanford-Binet Scale (Form L).” J. ment. Sci., 1943, 89, 92-101. 

35. Peterson, Joseph. Early conceptions and tests of intelligence. Yonkers: World 
Book Co., 1925. 

36. Pintner, Rudolf, and others. “Snpplemcntiuy guide for the Revised Stanford- 
Binet Scale (Form L).” Appl. Psychol. Monogr., 1944, No. 3. 

37. Schmidt, Bernardine. “Changes in the personal, social, and intellectual behav- 
ior of children originally classified as feeble-minded.” Psychol. Monogr., 1946, 
60, No. 281. 

38. Stoddard, George D. The meaning of intelligence. New York: Macmillan, 
1943. 

39. Strauss, Alfred A. “Enriching the interpretation of the Stanford-Binet test.” 
J. except. Child, 1941, 7, 260-264, 

40. Terman, Lewis M. "The Binet-Simon scale for measuring intelligence: impre.s- 
sions gained by its application.” Psychol. Clin., 1911, 5, 199-206. 

41. Terman, Lewis M. The measurement of intelligoiice. Boston: Houghton Mifflin, 

1916. 

42. Terman, Lewis M. The intelligence of school children. Boston: Houghton 
Mifflin, 1919. 

43. Terman, Lewis M. Genetic studies of genius. IV. The gifted child grows up. 
Stanford Univ.; Stanford Univ. Press, 1947. 

44. Terman, Lewis M., and Merrill, Maud A. Measuring Intelligence. Boston: 
Houghton Mifflin, 1937. 

45. Terman, Lewis M., and others. The Stanford revision and extension of the 
Binet-Simon scale for measuring intelligence. Baltimore: Warwick and York, 

1917. 

46. Thorndike, Robert L. “The effect of the interval between test and retest on the 
constancy of the IQ.” J. educ. Psychol., 1933, 24, 543-549. 

47. Triuder, Arthur. "What is a satisfactory IQ for admission to college?” Sch. 6- 
Sec., 1940, 51, 462-464. 



139 


THE BINET SCALE AND ITS DESCENDANTS 

48. Wissler, Clark. The correlation of mental and phijsical tests. New York: Co- 
lumbia Univ., 1901. 

49. Yoakum, Clarence S., and Yerkes, Robert M. Army mental tests. New York: 
Holt, 1920. 



CHAPTER 7 


Mental Diagnosis: The Wechsler Test 


wh ere as the B inet tgst was desi^e(l_to. yield a single. nieas.ure.oF.gea-iej:aJ 
mental ability, clinical users have hmg desired a test, which would reveal 
mdre^abouFthe pattern of an individual’s inentaHunctioning. If it is possible 
tcToEtain reliable tests which will describe the person’s particular strengths 
and weaknesses, as well as estimate his general level of development, such 
information will be of major value in guidance and in disposition of mental 
patients. The test which, at tlre_preseat_time, come s closestjto- an_adequatc 
diagnosis_of mental ab ilities- la the Wechsler-Be llevue Intelligence Scale. In 
this chapter we shall consider its characteristics, first as a general mental test, 
and then as a diagnostic instnunait This will provide an opportunity to con- 
sider the problems inherent in diagnostic testing in general. 

CHARACTERISTICS OF THE WECHSLER TEST 
Development of the Instrument 

While the Binet scale contributed largely to the development of insight 
into mental abilities, deficiencies were found in the practical utility of the 
Stanford and other revisions. Those who used the test with adults were par- 
ticularly troubled by the fact that they were planned for children. Wechsler’s 
ci'iticisms which led him to develop the Bellevue scale are worth quoting at 
lengtli ( 17, pp. 15-18) : 

“The scales now in use fail to meet some of the most elementary requirements 
which psychologists ordinarily set themselves wlien standardizing a test. The first 
deficiency is that they have not been sUmdardized on a .sufficient number of cases. 
Indeed mo.st of them were never standardized on any adults at all._ . . . 

“A more obvious shortcoming of the intelligence examinations now used for 
adults, and one that has often been noted, is the unsuitability of much of the ma- 
terial that forms part of the examinations. Many of the test it^s do not seem to 
be of the sort that would either interest or appeal to an adult, ... To ask the 
average adult to say as many words as he can think of in three minutes, or to make 
a sentence of the words to asked fmper nuj teacher correct 1 my and assume that 
he will be either interested or impro.ssed, is expecting too much. The average child 
responds to such questions as a matter of cour.se. The average adult will, as often 
as not, start wondering as to what pos.sible purpo.se or meaning the tests can have. 
Such remarks as 'That’s baby stuff,’ ‘Why do I have to do this?’ and ‘I never had 
that in school,’ are very common, particiifarly among less alert .subjects. . . . 

"Apart from the matter of interest and appeal, there are other serious objections 
to the type of material generally employed in children’s tests which make them 
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unsuitable for adult use. One of them is the fact that a:edit_for correctness of re- 
sponse so oft en depe nds_upon the individu a^capgcifaLJtCLxnanipulate vyorxls (or 
objects), rac ier tha n upon compre hension of their meanin g. . . . 

“ Another limitation to thejase of tests o n ad ults, originally standardized on chil- 
dren, is that many ’bf'fEese tests lay' altogether hjp-mueh emphasis, on speed as 
qompared to_jiccuraey._ . . . This is particularly tme for older subjects wh6'~d0“ 
badly on nearly all ‘speed’ tests. A number of explanations have been offered for 
this fact. The one which is of particular pertinence to our discus.sion is the possible 
influence of the different attitude which adults take toward set tasks of test situa- 
tions. When you tell a child, ‘Put these blocks together as fast as you can,’ the 
chances are he will accept the instructions at tlieir face value. One cannot lie so 
sure of a similar acceptance in the case of the adult. He might be ajype .of indu 
vidua l ehara et erizing the attitude, ‘look before you lea^ or ‘try to figure this 
tiling out.’ In that ease his attitude might only^rve to get him a lower intelligence 
rating.” 

1. Relate Wechsler’s criticisms to the chsuracteristics of a good test given in Chap- 
ter 4. To what extent do his criticisms refer to practical considerations, and to 
what extent to validity, reliability, and norms? 

2. With reference to Table 11, identify several Stanford-Binet items open to one 
or more of Wechsler’s criticisms. 

3. To what extent do Weehsler’s criticisms suggest that the Binet is a poor test for 
use with children? 

Because of these difficulties, Wechsler determined to build a test appropri- 
ate to his needs. Just as Binet’s interest in the failing school child helped to 
define his test, We chs le r’s position as clinical psyc hologist in New York’s 
Bellevu eHos pital influenced his g oals. Wechsler’s duties inc luded- examina- 
tion of criminals an d patie nts who mi g ht be feeble-minde d, psychotic, or 
“llliterSe. W hether th ose subjects were g enuinely subnormal in i ntelligence 
wasjpi.majoF importance-in deciding disposition. For Wechs ler. as for Binet, 
th e principal interest was in stu d yi n g die average and below-average g roup; 
lig,partieuJaE.4fiort was made to measure precisely the higher levels of adult 
m ental a bility. 

It becarne^particularly important for Wechsler, studying adults through 
very advanced ages, to establish a new approach to adult “normality.” While 
ad ults do not grow in mental power through adulthood, i t is not true that the 
average a dult ha s at all a ges the same m ental power he had at 16. Wechsler’s 
dStawom his test are presented in Figure 25. The average peif drmaaceof 
adults begins to drop in the early 20’s . I f performance at age 16 were Jaken 
as normal , the average a d ult at age 50 would have an IQ^of about 75, accord- 
ing fbthese data, since his performance is equ^olBiat of a T2-yea.r-pld, This 
S^a-d ear ln^uuiS isf(^‘y^:|£ng3^pTp^ dewafira Tfom the a verage. 
^WeefiS^set about assembling appr^flate test mate^als for his purposes. 
He is not very explicit about his theory of intelligence or the criteria used to 
select items. He subsc ribes to Binet’s idea of a gen eral or "glob ar ability, ^ut 
he also reco gnizes that, in patients particiilarly, abUityJunctions along some 
line s bett er tlian along others. Wechsler adapted from earlier tests a variety 
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of items which ho felt had been useful in understanding some of his patients. 
He w as seeking items which, whil e falling within the area we identify as gen- 
er al intell igence, had sufficiently unique characteristics to silhouette different 
type s of thin king or performance. 

After considerable clinical trial, the first edition was published in 1939. 
Early in the war, the Army requested Wechsler to make a parallel f orm f or 
its use. F orm B was used extensively in Army hospitals, and considerable 
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research by Army psychologists has been published. In 194fi..W eclrsler pub- 
lished FpimlLof tbe test, which was substantiall y the same_as Form B except 
for the discarding of experimental materials. The Wechsler test has received 
widespread use by clinicians, primarily with adult subjects. It is increasingly 
being applied to adolescents and is now being extended to make it appropri- 
ate for younger childi'en. 



JLj 1 1 1 I I I I I I I - I I 

5 10 15 20 25 30 35 40 45 50 55 60 65 

Ago 

Fig. 25. Changes in mental ability with age (17, p. 29). 


Description 

A detailed description of F orm I. which has been most w ide ly u sed, is the 
best means of introducing the test. F orm II is tire same Jiuform but has jsiif- 
f erent items. The Bellevue test is a point scale , in contrast to the age scale 
used by Binet. Eacli item is credited a certain number of points, the points 
earned being added to determine the score. The total number of,. plaints 
earned is converted , by norm tables, i nto a W echsler IQ. Wechsler’s test is 
divided into 11 subtests , ea p h containing one type of nerp . This makes it 
passible to d etermin e a se parate diag nostic score on each type, of beha vior. 
Wechsler groups t he siibtests in two sefi 'esVdhe dFwhich yields a “ verhalltX " 
the ot her a “performan ce IQ.” A total IQ is also used. 

The VerbaLscale includes tests of Infopnatiop, Comprehension, Digit 
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Span, Similariti es, Arit hmet ic, and Vo cabulary. Vocabulary was originally 
considered an alternate test but proved so helpful that it is now generally 
administered as a regular part of the scale. The Performance scale. includes 
Picture iy^rangement. Picture Completion, Block Design, C)bject Assembly, 
and Digi t Sy mbol. A brief description of eadi sub test follows. 

Information includes such items as “What is the population of tlie United 
States?” “What does rubber come from?” and “How many weeks are there in 



A. A Picture Anangciiient item. 


C. An Object Assembly 
item. 

Fig. 26. Representative materials from Wechsler performance tests. 

Comprehension questions include ‘"Why should we keep away from bad 
company?” and ‘Why are shoes made of leather?” The subject is expected to 
give a generalized, fairly direct answer. 

In Digit Span, the subject is asked to repeat after the examiner three- to 
nine-digit numbers. These are given in hurdle fashion, and the test is stopped 
after two failures at any level of difBcnlty. The same type of problem is then 
repeated, with the subject required to give the numbers backward. 

Similarities asks the subject to tell how the following are alike: orange and 
banana, air and water, poem and statue, etc. 

Arithmetic is a test of numerical reasoning ability using simple verbal 
problems, such as “How many oranges can you buy for thirty-six cents if one 
orange costs four cents?” The subject is required to do the items mentally and 
receives no credit on an item where he uses more than a reasonable time. 
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Vocabularij requires the subject to define or explain such words as “dia- 
mond,” “seclude,” “pewter,” and “chattel.” 

Materials for the Performance scale are illustrated in Figure 26. The pic- 
tures at the top are to be rearranged by the .subject to make a meaningful 
story. Items of this type constitute the Picture Arrangement test. 

In Picture Completion, the .subject locates the missing part of a drawing of 
a familiar object. 

Block Design requires the subject to produce such a complex design as 
(b ) in Figure 26 by fitting togetlier .separate cubes, painted differently on the 
six faces. 

Object Assembly makes use of tlrree cutout objects, a manikin, hand, and 
face (c. Fig. 26). The subject must put them together, one at a time. 

Digit Symbol is a timed test in which the subject writes beneath each 
number an arbitrarily assigned symbol. 

4. Which subtests have the following cliaraeteristics? 

a. The score is affected by educational background. 

b. The tc.st demands experiences found in the urban American culture. 

c. The test requires problem .solving or reorganization of knowledge rather than 
more recall. 

5. How do the Wechsler test items differ from the higher levels of the Stanford- 

Binet? 

After scoring responses according to Wechsler’s guide, the .itxaminer-obji 
tains the total score for each .s ubte st. T^ese are then translated into weighted 
sc ores to allow fo r-thejelatiye le ngths o f the tests. Since the weight ed sco res 
are s tandard sco res, the mean always heing' ffr~wp, can study -the profile o r 
patfOTii of tl;^ subject’s perfonnance. This is the diagno stic f eat ure o f the 
TesF. ‘ ' 

The weighted scores for verbal tests and for performance tests may be 
combined and converted into the IQ’s by reference to a table using the per- 
son’s chronological age. Wechsler made no attempt to define m ental a ge, 
since that concept is meaningless f or ad ults. Instead, He Refined the I Q. in 
per centile terms . The median IQ was defined at 100, a value close to the 25tli 
percentile was taken as IQ 90, and so on. Separate norm calculations were 
made for adults in different age ranges. A person who earns a total score of 
70 points has a Bellevue IQ of 79 if he is 16 years old, 89 if he is 35 years 
old, 93 if he is 45, and 97 if he is 55. Tables..ai£ availab le, formges TO to 59. 
IQ’s may also be compute d for mor e advanced ages . 

Even more than j:he Binet test, the Wechsler provid es onnortim itv for a 
sJtgjodard-ized-'obsexvatiQi^ .of.bfiliayior. The" subjects” personality character-' 
istics are evident in his attack upon the performance items in particular. The 
Comprehension and Similarities tests of the verbal section are also informa- 
tive, since the person’s thinking liabits are demonstrated. Projective content 
may be elicited by some of the questions. The boy who explains the similarity 
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of praise and punishment with “They both come together. Whenever you get 
praised for something, then you get punished right afterward” reveals much 
about his morale and his relations to his parents. Such clues cannot be used 
except as a guide to further study and attempted confirmation. 

The Wechsler test is comparatively simple to administer. The full test re= 
qu ires from 45 minutes to an hour. The directions are less complex than in 
the Binet, a'n3~the~lvs'tem or^r oup ing si milar tests reduces the task.^ the 
ex aminer. T he ability of the examiner to establish rapport and his carefulness 
in following directions may infiuence the score greatly, so that skilled-testing 
is required . In some of the verbal tests, the examiner must make rather sensi- 
tive judgments as to the correctness of an answer since it may be necessary 
to request the subject to elaborate liis meaning. Answers that seem wrong 
may be correct when the subject explains himself. Subjectivity in scoring^ 
borderline answers is also a potential problem. 

6. A 16-year-old earns a Stanford-Binet IQ of 79. What IQ’s would be given by 
Terman to persons 35, 45, and 55 who passed the same subtests as he doe.s? 

7. Several experimental tests in the Army Wechsler have not been generally pub- 
lished. One was a maze test. What data would one obtain to decide whether the 
maze test should he made a permanent part of the Performance scale? 

8. Wechsler standardized his test on a sample entirely composed of white persons. 
How would one interpret an IQ of 100 obtained by a Negro? 

9. If you were working in New York City, attempting to establish norms which 
would be comparable to the performance of Americans in general, what factors 
would you have to consider in selecting cases? How could you proceed? 

Reliability and Validity of the IQ 

As one of the outcomes of the Wechsler test is an all-over IQ, it may be 
judged in the same manner as the Stanford-Binet. Reliability and validity 
coefficients have been reported from several studies, although the research is 
less complete than for the older test. The coefiicient of equivalence, s plit-half. 
fo r 355 young adults , was .90 in one study. The s tability o f_t he IQ over 
intervals less than one year is indicate d by two retest cpefiScients-of -94 ob- 
tained for small groups (17). These studies sugg est that error of nieasuret, 
ment is u.sually small and that the Wechsler is as.. accurate as the Stanford- 
Binet. 

Correlation with other tests is a favorite method of “validating” mental 
tests. The Stanford-Binet JQ and the Wechsler IQ cqrrel%te_veiy highly. 
Studies reviewed by Rabin (11) and Watson (16) find cg rrelations from .80^ 
to^^O^when the range m ability is not limited by pre-selection. AppaTently 
Ihfe Wechsler correlates witli Binet almost as well as two administrations of 
the same test correlate. But tire fact that Wechsler IQ’s are figured in a differ- 
ent manner means that a sub ject’s W ech sler IQ will be quit e d i ffe ren t from 
hi.s Binet TQ. even though 'His rank in h is age g roup will be fairly simi lar on 
the t wo tests . An important finding is Aat ffie Ve^aFl Q usually correlates 
withTIie Binet and other tests far higher than the Performance IQ. For nor- 
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mal adolescents, the correlations witli Binet IQ were; Full Scale, . 86 ; Verbal, 
,80; Performance, .67 ( 5 ) . 

F ew studies have related the Wechsler IQ to non-test criteria. With one 
.g roup of college freshmen, the Wechsler Verbal IQ correlated .52 with 
marks, about the same as the Binet or the American Council J 5 i;pup test for 
the same cases ( 1 ) . 

10. Wluit do the high correlations between Wechsler and Stanford-Biiiet IQ’s .sug- 
gest regarding the importance of not using childish items with adults? 

DIAGNOSTIC USE OF THE WECHSLER TEST 

It was explained earlier that a profile could be obtained for the 10 or 11 
subtests of the Wechsler. If each subtest calls for a different mental ability, 
an interpretation of the profile should be yery informative. There are two 
basically different systems of inteqDretiug the profile. One is the "signs” ap- 
proach, which resumes no assunyitions f^out the logical validity of .the sub- 
tests.- The psychologist merely looks in the profile for peaks and valleys 
known to be common in certain types of cases. The other system is an in- 
terpretation in terms of the mental functions the tests are thought to measure. 

Empirical Interpretation of Profiles 

Most scientific progress begins with empirical observations. In clinical psy- 
chology .somebody notices that mental patients behave differently in some 
respect, and gradually, as a result of much analysis, a theoiy to explain the 
difference evolves. When the re.sults are merely empirical, tliey are little 
understood, but they are .still useful. For instance, we know that some 
patients respond to brain surgery for tlie relief of psychosis but that others 
are not helped. A psyc hologist might try the Wechsler test with patients 
before oper ation (na..Que,.Iias) an d find that patients who will recover do 
especially weU^ the Block Desi gn test. This could not be immediatciy ex- 
plained, .but it could. b.g put into use at mice as a “sign” to.., help,, doctors 
decide whether to advise an operation. This, is the way the d iagnostic u se of 
tire Wechsler test has been built up, sign by sign. 

Numerous studies have been made comparing the subtest scores of pa- 
tients who have been classified by psychiati'ic examination. The full catalog 
of signs.is far too long to describe here, but^ few exam ples can be given. 
Ve rba,! IQ Js generally higher tbso.Performance'lO f’of'organic. brain dain- 
age, schizophrenia, and neurosi s, but lower for adolescent psychopaths and 
(l^). Given the problem, then, of a young criminal wifETa 
low IQ, we have a sign which helps us to decide whether he is feeble-minded 
or normally capable but performing badly because of emotional blocks. No 
final decision is made from one sign. For schizophrenia, Wechsler lists the 
following signs: Verbal higher than Performance, relatively high Information 
and Vocabulary, average to superior Digit Span and Block Design, average 
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to poor Arithmetic, and low Object Assembly and Digit Symbol (17), Some- 
times the signs are combined in complex formulas, as with Rabin’s “schizo- 
pluenic index,” the ratio of Information plus Comprehension plus Block De- 
sign to Digit Symbol plus Object Assembly plus Similarities (10). No one 
proposes that tlie Weehsler be substituted for other methods of clinical di- 
agnosis; it nier^_sets up hypotheses which aid the clinician in thinking 
about his various data. 

One feahire of importance is the appl icabilit y of the We ehsl er test to,.. 
deterioration jwidi_age. Scores of older subjects decline especially on certain 
tests. Tliis has been variou.sly explained in terms of loss of mental speed, lack 
of continued practice on intellectual puzzles, and other factors. By inclu din g 
both tests which hold_up with advancing age and„tests_ which decline, Ae 
te.st malres it possible to esttmVte.’b'otE'an older subject’s present, level and 
the level of abili^he had before deterioration set in. 

Functional Interpretation of Profiles 

In^ contrast -to. the strictly empmeal approach, the fu nctio naLappi^each 
requires the interpreteLto know what ment al fu nctions eadi test requires, so - 
that he. caiLSEqi-what the peaks and valleys in the profile mean. This interpre- 
tation may_b.e based, on the profile alone, hut far-more information is qbtained 
from,a stu^y of sp ecific answers and of work habits in the pei'for man ce tests. 
In the verbal tests, there are many opportunities to spot anxiety, ideas of 
reference, incoherence, pretentious verbosity, and other deviations from nor- 
mal. In the performance tests, one watches for unplanned ti'ial-and-error, 
sticky perseveration, blocking, and lack of critical evaluation. Rapaport illus- 
trates “within test differences” (12, p. 40): “If a subject knows how many 
pints there are in a quart, but does not know what the Koran is, this will give 
us merely an idea of his range of information. But if he knows what the 
Koran is and asserts that a quart has four pints, we must consider the pres- 
ence of a temporary inefficiency; and if he insists that the Capijal of Italy is 
Constantinople, or that the Vatican is a robe, psychotic maladjustment will 
have to be considered.” 

What the tes ts m ean is not reducible to such simple formulas as “low 
Vocabulary means poor ability to deal with symbols.” Vocabulary is affected 
by the extent" of the .subject’s past exposure to our verbal culture, including 
schooling, and his ability to express liimself, which may be handicapped by 
emotional disturbance. It differs horn JSi milariti es-iii-thaJJ^cabula ry c an be 
.p^sed on the basis of routine memory, whereas Similarities requires flie 
suBjeSrtcrreprgahTze lus informationT'This may expla in the empirical finding 
that Vocabulary, is. higher than Simil arities f or cases with organic brmn dam- 
age_( 13 ) . All the elements that influence a particular test can be known only 
through great experience with the test. The most careful simple analysis is 
that of Rapaport and his co-workers, who group and characterize the tests as 
follows (12) I 
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Verbal: 

Vocabulary — dependent on early education, resistant to deterioration. 
Information — indicates early euvii-oninont, education; can be blocked by tempo- 
rary anxieties. 

Similaritie.s — concept formation . 

Comprehension — requires judgment in reality situations, delay of impulses; re- 
flects emotional states. 

Attention and Concentration: 

Digit Span — ^mainly a test of attention. 

Arithmetic — ^mainly a test of concentration. 

Visual Organization: 

Picture Arrangement — calls for planning, anticipation. 

Picture Completion — concentration, appraisal of relationships. 

Visual-Motor Coiirdination : 

Object Assembly — recognition of patterns, guidance of motor action. 

Block Design — differentiating a pattern into parts, reproducing it. 

Digit Symbol — ^jisychomotor .speed. 

Another interesting classification, based on normal rather than clinical sub- 
jects, is schematically represented in Figure 27. The linguistic , clerical, and 
spatial ^oups_.shaw correlations with aptitude tests clesigued to me asur e 
these respective abilities, .but.the correlations arg too low to permit using tire 
Wechsler subtests in vocational guidance. 
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Proposed classification of Wechsler subtests (after 4), 


It is difficult to describe just what any subtest means. Even within the same 
subtest, questions differ psychologically. In Comprehension, for example, 
some questions reflect knowledge of what to do in real situations and others 
require ability to formulate rather abstract reasons for marriage laws or 
taxes. Factor analyses show that the tests have different relations in different 
age groups, and must presumably be interpreted differently at these lev- 
els (2). 



Case 1 (17, p. 165) 


5 6 7 8 9 10 11 12 13 14 IS 

Comprehension ^ ^ , 

Arithmetic , ^ 

Information , 

Digit Span 

Similarities , , , 

Picture Arr. 

Picture Comp, _____ 

Block Design , , 

Obiect Assem. , , 

Digit Symbol , 


Analy.sis by Wechsler: “Male, age 25; col- 
lege graduate, ‘Y’ secretary. Neurotic pat- 
terning: Verbal higher than Performance, 
high Comprehension and Information vrilli 
relatively low Digit Span. Low Picture Ar- 
rangement and low Digit Symbol, Object 
Assembly leas than Block Design. Negati\e 
8i(tn: relatively good Arithmetic, but in 
this connection account must be taken of 
fact that .subject is a college graduate with 
a B.S. degree.” (D. Wechsler, Measurement 
a/ Adult Intelligence, Williams & Wilkins 
Company.) 


Verbal IQ 120; Performance IQ 116 


,5 


Comprehension 
Arithmetic 
Informotion 
Digit Span 
Similarities 
Vocabulary 
Picture Arr. 
Picture Comp. 
Block Design 
Object Assem. 
Digit Symbol 


Case 2 (7. p. 121) 

6 7 8 9 10 II 12 13 14 15 Analysis by Ivnight el al 


Verbal IQ 117; Performance IQ 112 


impairment of 
trends.” (Bull. 


Male, ago 26, 
engineering graduate who repeatedly at- 
tempted suicide. In the bright normal in- 
telligence grouip. The scatter is very con- 
siderable, ran^ng over eight points; and 
while Information, Picture Comiiletion, and 
Blo^ Design are quite good, planning 
ability and judgment are quite impaired 
and psychomotor speed is poor. The pa- 
tienvs general mental efficiency is poor, 
with e.i£tremely poor learning efficiency. 
Comparison of the Information with the 
Comprehension score shows the impairment 
of judgment; comparison of Picture Com- 
pletion with Picture Arrangement shows 
the impairment of planning ability. It is 
the extremely low Digit Symbol score which 
makes the scatter a psychotic one. This 


psychomotor speed must be considered as probably indicating depressive 
Menu. Clinic, 1943, 7, 114-128.) 


Case 3 (14, p. 617) 

4 7 8 9 10 11 12 13 14 15 Analysis ^ Schafer: [Sex and age not re- 

TOrted.J “Tliis case is a conversion hysteric. 
If we look first at the scatter, its outstand- 
i^ aspects are the impairment of Informa- 
tion and Digit Span; Commehension, i.e., 
judgment, is well retained. Digit Span is a 
test of attention. As a test of attention. 
Digit Span is especially vulnerable to anx- 
iety. Hence, our first inference is that intense 
anxiety is present; there is a 6 unit dilFerenoe 
between tlie Dimt Span and Vocabulary 
scores. In regard to the low Information 
score, our understanding is this: at the core 
of hysterical neurosis is a pathologically 
excessive use of repression as a mechanism 
of defense. Effects of repression become 
widespread, when knowledge, information 
... all may become dangerous to the hys- 
terical adjustment and, therefore, have to be avoided; otherwise, these thoughts might, in 
some more or less distant way, touch upon the repressed ideas and thereby mobilize further 
anxiety. As a result of the widespread repression, the function underlying acquisition of in- 
formation suffers, and the Information score becomes low. Analysis of verbalization shows 
that, in regard to the fire in the movie house, the subject would yell “fire” and run to the 
exit. The contrast of this impulsive response and the high Comprehension score indicates 
the characteristically hysteric-like alfective lability. AU these indications establish the 
diagnosis of hysteria.” (By permission of the New York Academy of Sciences.) 

Fig. 28. Bellevue test patterns and their interpretations. 
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Verbal IQ about 112; Performance IQ 
about 116 
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Results of Diagnosis 

Figure 28 ^nesents three diagnoses resulting from the Wechsler-Bellevue 
test. Even more elaborate discussions than these can often be given when 
the clinician lists all the tentative hypotheses suggested by minor indications 
in die test. 

Attention is called to the great similarity of Cases 1 and 2, so far as test 
scores are concerned. Yet one is diagnosed psychotic and the other neurotic. 
Probably each of tlie single differences between the cases (e.g.j in Compre- 
hension or Digit Symbol) is too small to be reliable. In each case the clinician 
reached the conclusion he did on the basis of the total pattern of scores, from 
study of detailed responses within the test, and perhaps from other knowl- 
edge about the case. Because it-makes such.d em ands for refined_discrimina- 
tion. between very similar profiles, tlie Wech.sler is not an. easy tooL±a.use for 
diagnosis. This is especially true because the profile is unstable owing to 
error of measurement. The te^t tempts the psychologist to make more defi- 
nite diagnoses than its validity warrants. 

11. Bill and JoJin, two 1.5-year-okls, are referred to tho .school p.sychologist be- 
cause both are failing in ninth-grade work, their courses being social studies, 
English, general science, and art appreciation. Both have IQ’s of 93, but Bill 
has a Verbal IQ of 95 and a Performance IQ of 92, while John has a Verbal 
IQ of 87 and a Performance IQ of 106. How would the interpretations and 
suggestions for dealing with the two boys differ? 

12. What limitations are imposed on the Bellevue te.st as a re.sult of the attempt 
to make it a diagnostic measure? 

13. What advantage would there be in using an “intelligence” te.st to diagnose 
abnormal personalities, over using a “personality” test having similar validity 
for that purpose? Would this argument hold if the “personality” test had defi- 
nitely higher validity for diagnosing such ca.ses? 

14. What is the size of the usual difference between weighted scores Wechsler 
calls high and low, in Case 1, Figure 28? 

15. Clinical psychologists frequently distinguish between a patient’s “intellectual 
equipment” and his “functioning ability” (using various words for these con- 
cepts). Equipment is thought of, not as congenital capacity, but as the maxi- 
mum intellectual power the person could summon up at this time. The equip- 
ment often does not function at its maximum, however, because of impulsive- 
ness, inhibition due to anxiety, autistic thinking, and other limitations. In terms 
of this point of view, people fall below their true potential to varying degrees. 

a. In terms of these concepts, what does the Wechsler test reveal? 

b. In terms of these concepts, what does the Binet test reveal? 

c. Which of tliese concepts comes closest to “intelligence”? 

PROBLEMS OF DIAGNOSTIC TESTING 
Problems of Reliability 

Having in mind the characteristics of the Wechsler test, we may now 
consider numerous problems of diagnostic testing which it illustrates. The 
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considerations to be discussed here apply equally to many alleged diagnostic 
tests of mental ability, achievement, and personality. 

T®st>5_.g6p6i'^lly have great difBculty in providing diagnoses which are ade- 
quately reliable. The total score is often reliable, but decisions are made on 
the basis of subtests which are usually short. Every subtest is a test in itself 
and m ust demon strate adequate reliabUi^ and validity. Moreover, since the 
diagnostic test is almost always intended to make important decisions about 
a particular individual, high reliability of subtests must be demanded. Un- 
fortunately, many test manuals advise the user to interpret subscores without 
giving him data upon which to judge their dependability. The Wechsler test 
has had this fault, but the previously unpublished data in Table 20 indicate 
that many of the subtests are adequately reliable. These data are based on a 
selected group of patients; a study on a normal group is still needed. The 
coefficients were obtained by correlating Form I with Form II. A few of the 
tests are too inaccurate to be given weight in making decisions about a 
patient. Clinicians are often impressed by the successes of Wechsler diagno- 
sis, so that they express little concern about reliability figures. It is far too 
easy, however, to remember one’s successes and forget one’s failures. Tests 
cannot be accepted on the basis of testimonials without conti'olled evidence. 


Table 20. Coefficients of Equivalence for Subtests of the Wechsler Test, Based on 
1 00 Cases with IQ 40 to 1 00 


Subtest 

Correfation 

Between 

Forms** 

s.d.i 

(Raw) 

Presumable 
Reliability 
in Unrestricted 
Group'* 

Probable 
Error of 
Weighted 
Score*' 

Information 

.73 

2.2 

.86 

1.1 

Vocabulary 

.68 

1.4 

— 


Comprehension 

.44 

2.0 

.78 

1.4 

Similarities 

.55 

2.2 

.70 

1.6 

Digit Span 

.63 

2.4 

.78 

1.4 

Arithmetic 

.65 

2.3 

.79 

1.4 

Block Design 

.80 

2.7 

.85 

1.2 

Digit Symbol 

.81 

2.1 

.92 

0.8 

Picture Arrangement 

.63 

2.5 

.76 

1.5 

Picture Completion 

.34 

2.5 

.54 

2.0 

Object Assembly 

.56 

3.5 

.54 

2.0 


Correlations' of Form 1 and Form TI, computed hy WecKsler (unpublished). 

* Correction based on .siginos given by Wechsler (17, p. 222). 

** A score for a subject is, in 50 percent of all coses, more than one probable error from his true score. 
The score is almost certainly within three probable errors of bis true score. 


For two scores to be compared with each other, as in a profile, the diflEer- 
ence between them must be reliable. If two scores are h ighly coixelated prjf 
e ither one i.s unreliable. diff erence.s between them are likely to be due to 
chance. Bennett has produced a nomograph which reduces to simple terms 
the diagnostic power of any set of tests or subtests. Nearly everyone given 
two tests will be higher on one or the other, Bennett’s nomograph, Figure 29, 




.70 .72 .74 .76 .78 .80 .82 .84 .86 .88 .90 .92 .94 .96 .98 


Mean Reliability Coefficient 

Fig. 29. Nomograph for evaluating reliability of tests intended for differen- 
tial diagnosis (3). Find the average reliability of the tv/o tests and locate this 
value on the horizontal scale. Find the correlation between the tests and locate 
the curved line for the value nearest to ria. Move along this line until directly 
above the mean reliability coefficient. Note the corresponding value on the ver- 
tical scale, which indicates what proportion of differences in score are in excess 
of chance expectation. 
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indicates the percentage of such diflFerences that are “real,” i.e., likely to he 
found over and over if we retested the same person. Applying the nomograph 
tp the difference between Comprehension and Information illustrates its use, 
The correlation of these two scores, from Wechsler’s data, is .69; the reliabili- 
ties are .78 and .86. This means that 13 percent of the differences we find 
between the two tests are reliable, and that only the largest differences are 
trus tworthy. Most persons showing a difference between the two scores 
might show the reverse difference on a retest. Whenever differences in a 
proflle-are-tCL he interpreted, the subtests should be checked by the nomo- 
graph to see if they are precise enough to permit such discrimination. Few 
current diagnostic tests will bear this examination. 

The nomograpJLsho^ that the higher the cori;^atipn bet^en jubtests, 
the less likely they are to ^giye a_.reliabie profile. If much of what two tests 
measure is the same, only a sipall portion of each test is left to carry the 
weight of diagnosis. This has an important bearing on diagnostic test con- 
strujCtioiL To yield a meaningful total £<»re, the sever a l p arts should .usually 
common; ^ipth Binet and Wechsler rely onj^s to meas- 
ure general intelligCMe. It is not possible to eat one's cake and have it; the 
correlation among subtests which makes for good measurement of the com- 
mon^actor interferes with diagnosis. The easiest way to get a r eliable di- 
agnostic picture is to use tests which have yery little in common; . tlie n the 
to tal is a poor measure of the general factor. For this reason, tggts such as . 
Thurstone’s Tests of Primaiy Mental Abihties (p. 206), in which the subtests 
are relative ly independent m ay prove to be best designe d for dia gnosis but 
least satisfactory for a "g lobal” e.stiinate. 

16. Case 1 in Figure 28 had a Comprehension score of 14. What can he said about 
his probable true score (cf. Table 20)? 

17. The Kuder Preference Record "diagnoses” interests for the purpose of voca- 
tional guidance. Two of the scores Stained are Literary and Persuasive inter- 
ests. Their intercorrelation for college students is .11 and their reliabilities are 
.93 and .90. What percentage of the subjects having differences between these 
scores really differ in the factors mentioned? 

18. The Iowa Silent Reading Tests (Advanced) “diagnose” reading abilities. Two 
of the scores are Sentence Meaning and Paragraph Comprehension. Their in- 
tercorrelation is .48 and their reliabilities are .75 and .76.* How satisfactory is 
a diagnosis based on a comparison of these subtests? 

19. Can you detect differences among the subtests of the Wechsler scale which 
would explain why some are more reliable than others? (The differences be- 
tween tests in Table 20 cannot be entirely explained from a study of Form I, 
since different questions are used in the two tests. ) 

Validation of Diagnoses 

It is especially difficul t to validate, d i a gno s t i c t est.s bec a use - cd. t( a r ia.m u . s t 
be obtained for e ach type of jud gment the test perrnits. A criterion, is avail- 


*• Both reliabilities are great overestimates, since the tests are speed tests. 
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able for the Wechsler test because it could be applied to patients previously 
diagnosed by more complete methods. The usual method for demonstrating 
that a particular subtest predicts a given classification is to show that the 
mean score of patients of that type differs from normals or other patients on 
that subtest. Such evidence has been piled up, and tested by statistical 
means, for a great as.sorhnent of signs. It cannot be doubted that the Wechs- 
ler pattern is related to psychiatric diagnosis. Two cautions are important, 
however. 

Qn^e^is that signs jnust be confirmed by repeated studies before they can 
be trusted; even statistically significant differences may arise owing to acci- 
dent, or to failure to equate the patients in difl'erent groups on age, educa- 
tion, and other factors. One strongly suspects that some of the signs discussed 
in the literature of the Bellevue test are mere chance findings due to acci- 
dents of sampling. Magaret and Wright, for example, found that several of 
the alleged schizophrenic signs were usually present in morons ( 9 ) . 

The second limitation of validation by mean scores is that the rclatioj^ship, 
even though joresent, may be too low for individual predictions. Many studies 
agree with Wechsler that schizoplrrenics have, on the average, Verbal IQ’s 
above their Performance IQ’s. But when we look at Rajjaporl’s data ( 12, Ap- 
pendix 11), we find that only 31 out of 72 schizophrenics have Verbal IQ’s 
five points or more above the Performance IQ. Even among the fairly normal 
highway patrolmen used as a comparison group, 18 out of 54 .showed this 
sign. Diagnosis ba.sed on multiple signs reduces errors of clas.siflcation, but 
few patients show all the signs of their class. Levi set up two formulas for 
identifying psychopaths. But only 49 percent of psychopaths are identified 
by tliese tests, while 9 percent of tire non-psychopaths show both signs ( 8 ) . 
Since there are more normals than psychopaths, these signs would falsely 
label many adequately adjusted people. 

Hunt and ^evenson comment on tlie falsity of attempts to set up-standr 
ardized, mechanic^' diagnostic schemes which would eliminate-^^elinical 
judgment” (6, p. 29) : 

“Two things are wrong with this picture. In the first place all the statistical 
procedures involved in test validation are group procedures. They are based on 
a group tendency and allow for a margin of error which expresses itself as a dis- 
persion around the central tendency. Ideally, this margin is never large for many 
cases. The statistician therefore writes it off as statistically unimportant, and the 
psychologist interested in group testing writes it off as operating expense. The 
clinical psychologist, however, is not dealing with groups, but with individuals, 
and they are individuals who usually come from the deviant group, the atypical 
persons to whom a gioup-validated test does not do justice. In handling mese 
deviants, any standardized testing procedure frequently will yield erroneous re- 
sults which can only be corrected by the psychologist’s clinical evaluation. In the 
second place tlie goal of perfection has not been even approximated in many test- 
ing fields. Whatever dangers are inherent in applying group techniques to an in- 
dividual are accentuated by the imperfection of most of our present techniques. 
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The fact that psychometrics has perfection as its goal may satisfy our super-egos, 
but it should not blind us to the practical fact that this goal is far away at present.” 

Attempted descriptions of mental make-up are almost impossible to vali- 
date. The interpretation is not capable of statistical evaluation by our present 
methods. Although there are certain starting points for interpretation, the 
clinician treats the same test profile differently in different cases. In one case, 
he makes allowance for a hearing deficiency; in another, for poor education 
or a foreign-language background; and in another, for the patient’s manner 
in attempting the test. Although some published discussions imply that the 
profile alone yields an adequate description of mental functioning, the in- 
terpreter actually bases his analysis on everything he can observe about the 
person and every fact from the case history which would throw light on the 
test performance. Different clinicians probably notice difterent things and 
would make different reports about the same person. In validating a func-. 
. tional diagnosis we must study, not the validity of the test, but the validity 
of-iie diagnostician. 

Clinical psychology is today moving slowly away from intuitive methods, 
but diagnosis is far from an objective, scientific level (15). It is still largely 
an artistic process, in which the psychologist relies on small cues tliat he him- 
self is not always explicitly aware of, and combines his knowledge into a 
description that somehow seems to “fit,” AH the research on the Wechsler 
test suggests that there must be many errors in such a process. When we as- 
sume, for instance, tliat fumbling trial-and-error on the Block Design test 
implies similar behavior in problem solving outside the test, we are some- 
times wrong. Undependable assumptions make diagnostic testing quite in- 
adequate by the standards usually applied to validity. But we are at present 
faced with the choice between relying on these tests and assumptions, jind 
tr)^g tcTdeai with people without even this partially trustworthy informa- 
tion' So long as tlie psychologist must make a diagnosis, he is wise to use the 
best indications he can get, no matter how imperfect they are. Erroneous 
.couciusions from diagnostic tests need not be harmful because, if a patient is 
misclassified, further work with him will probably bring the error to light. 
There is no evidence to warrant relying on a functional interpretation of any 
current test as a basis for final decisions. 

Defining Normal Performance 

The profile method, which is inherent in all diagnostic testing, requires 
referring each subtest to a “normal” reference line. This reference line is 
usually the average or median of the standardization group on each test. A 
particular subject, however, must be diagnosed not as a specimen of “men in 
general” but as a member of some subclass, such as “40-year-old illiterates. 
The score on some verbal subtest that is merely normal in the standardiza- 
tion group may be exceptionally high compared to the average for illiterates. 




Profile of Mary Adams on Iowa Silent Reading Tests, Now Edition, Ad- 
vanced, plotted in normal manner. 



Profile of Mary Adams on Iowa Silent Reading Tests, New Edition, Ad- 
vanced, based on percentiles for college freshmen. (Top profile reprinted by 
special permission from lotva Silent Reading Tests, New Edition, Advanced 
Test. Copyright by World Book Company.) 


Fig. 30. Effect of norm group on shape of profile. 
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If the profile were plotted against a norm line representing the average per- 
formance of middle-aged illiterates, an entirely different impression of the 
man would be gained. Wechsler guards against this error by his many 
studies of the normal performance of older subjects and other special groups, 
but other tests have made little allowance for shifts of the norm line. 

The Iowa Silent Reading Tests provide an illustration of the effect on the 
profile of choosing a different norm group. This test is designed to give 
teachers or reading specialists a quick picture of the subject’s proficiencies 
and deficiencies in separate aspects of reading. The published profile form is 
based on the performance of the average 16-year-old, When the test is used 
with college freshmen, however, it is far more important to know how a sub- 
ject compares with a group like himself. Figure 30 shows the profile of a 
college freshman, Mary Adams, plotted on the usual profile sheet. A new pro- 
file is then shown, comparing Mary to the average college freshman. The 
difference in impression given by the two profiles is substantial. 

20. What are Mary’s strongest and weakest abilities, judged by the standard pro- 
file sheet? Judged by the profile based on freshmen? Which picture of Mary’s 
reading skills is more accurate? 


Norms for Diagnostic Batteries 

One problem which does not affect the Wechsler test should also be con- 
sidered here. Diagnosis is frequently based on a battery of tests, each of 
which was designed to measure some one function. If we know that a man is 



Fig. 31. Record of a 21 -year-old veteran, plotted on the profile sheet 
used by the Veterans Administration. 


above average in finger dexterity, below average in general verbal intelli- 
gence, and at average in mechanical comprehension, we can make valuable 
suggestions regarding his educational and vocational plans. The difficulty, is 
diat “average” for each test is usuaffy determined from a different group of 
cases. The sIia]^j2Xpn3iS®4»S&waiiigle^ and misleading unless the ref- 
erenc e line represents -the-perfoi'toance of some clear-cut gj;QUp,-JLlntil supe- 
rioirHiagnostic batteries become available, we can obtain only faulty profiles. 
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Figure SI shows the profile sheet used by the Veterans’ Administration with 
records of performance on five widely used tests. The norm groups for each 
test are also indicated. In view of the fact that each test is standardized, not 
only on difiPerent men but on diflFerent types of men, tlie shape of the profile 
is of uncertain value. Competent counselors pay attention to the shift of norm 
groups, so that they avoid being misled; less trained workers, however, and 
the client himself, often fail to make due allowance for the incomparability 
of the tests. To avoid this difiiculty we need batteries, every test of which 
is standardized on the same sample. Recent examples are the Differential 
Aptitude Tests, and the aptitude tests of the U.S. Employment Service. 

SUMMARY EVALUATION OF THE WECHSLER TEST 

Wechsler’s test conforms well to the specifications he laid down. It is a 
practical test — not difficult to administer, easy to score, and informative. 
Wechsler’s judgment, which led him to select certain subtests because he 
thought they permitted diagnostic insight, has been more than justified in 
use of the test. The opportunities for observation of behavior in both the 
verbal and performance sections are excellent. 

At the same time, one can point to numerous shortcomings. Most of these 
arise from Wechsler’s emphasis on clinical utility rather than upon any 
theory of mental measurement. The test consists of a random collection of 
items, most of them 20 or more years old. Although Wechsler collected norms 
conscientiously, and attempted some compensations for his failure t o t est 
subjects outside the New York area, his norms are probably not representa- 
tive of Americans generally. IQ’s are determined by an arbitrary computation 
having no particular advantages or rational justification. Technical reports on 
validity and reliability have been inadequate. The test is not difiScult engjugh 
for measurement of superior subjects. Furthermore, the test was based on no 
clear theory of intelligence. All these criticisms suggest that the test is. far 
short of the best that could be designed at present. If some worker now were 
to begin with a refined conception of what abilities .should be isolated, and 
designed new tests for any ability not measured in the tests he happened to 
find already available, he would probably have a superior diagnostic tool. If 
such a test were standardized and validated according to procedures usually 
demanded in test research, it should be a major contribution to mental test- 
ing. 

But such a test would differ from the Wechsler primarily in refinement 
rather than in general design. Many of Wechsler’s subtests would be,re- 
tained,.since they do reveal significant mental characteristics. The Wechsler 
test at present gives a generally dependable IQ, useful in work with normal 
and abnormal adolescents and adults. Because the subject reveals himself in 
many subtle and detailed acts and remarks, the richest meaning of a Wechs- 
ler performance is apparent only to a skilled clinician. Most of the formulas 
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and signs jntended to permit mental diagnosis on the basis of simple combi- 
nations between scores are far from convincing. 

Whether the Wechsler test is better or worse than the Stanford-Binet, for 
ages under 16, is at present a matter of personal preference. Gradually re- 
search will indicate which test serves best in school practice, in work witlr 
emotionally disturbed children, and for other purposes. Such evidence is not 
now available. 
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CHAPTER 8 


Other Tests of Genera/ Ability 


It is the purpose of this chapter to acquaint the reader with the variety of 
tests available for measuring mental ability. In addition to the Binet-type 
scales and the Wechsler scale already discussed, there are numerous special- 
purpose individual tests, and group tests of equal variety. There is little merit 
in learning the characteristics of every test now on the market, so few specific 
descriptions will be given. The tex t will p,oint_put general features of each 
gro up of tests, wi th tables summarizing features of tlie most widely known or 
most cliaracteristic ones. 


PERFORMANCE TESTS 
Purposes and Characteristics 

In this chapter we are concerned with performance tests as they are used 
to study general mental functioning, rather than those which measure such 
more sjjecific abilities as dexterity and spatial ability. Performance tests are 
employed widely, especially in clinical practice. They are ordinardy usedjn 
conjunctiQn_,witli jifii’baLtests, sin ce the d ifference between. verbaLjand,..pe.r; 
formance levels often throws light on_deficiencies_resulting from finviriHi: 
~mCTtd'1iah^cap7Tathe r tfianlack'^ potentiah'^ Performance tests are es; 
ypintial in the examination of persons with a lang uage h andicap due to foreign 
background, deafness, restricted education, or other causes. 

Performance tests have already been encountered in Chapters 6 and 7, 
since they play a prominent part in the Binet and Wechsler scales. The t^s 
most often used in performance scales are quite sirnil^ to those emplo^^by 
^^aScEsIerrlOactr"a~rSiitedlmmber of tasks have been used repeatedly in 
different scales, with or without adaptation. Among those most_fifiquently 
encountere d are the Manikin and other obi'e ct assembly t ests ( found in the 
WeShslei^ellevue), Koh s Bio clTDesign ( from which Wechsler Block De- 
sign is adapted), formboard (in'\^ch the subject arranges cutout geomet- 
rical forms in corresponding holes or completes a picture puzzle); and Knox^ 
cube (in which tlie examiner taps on four blocks in irregular order and the 
subject must repeat the tapping pattern from memory ) . The contribution of 
investigators who assembled such tests into batteries was to standardize them 
on the same population, so that norms from test to test were comparable and 
the combined performance on all tests could be used to estimate general 
mental level. 
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Table 21. Representotive Performance Tests of General Ability 


Test and Publisher 

Age Level 

Tasks Included 

Special Features 

Comments of 
Reviewers'^ 

Arthur Point Scale 

4 '/s to su- 

1 0 subtestss form 

Can be given in 

“A good supplement 

of Performance 

perior 

board, maze, block 

pantomime. Easy to 

to verbal tests of 

Tests 

(Stoelfing) 

adult 

design, etc. 

score. More satis- 
factory at upper 
ievels of develop- 
ment than Cornell 
scale 

the Binet type" 
(Brovrn, 7, p. 202). 

Corneli-Coxe Per* 
formance Ability 
Scale 

(World Book) 

6-15 

6 subtests: block de- 
sign, digit symbol, 
cube construction, 
etc. 


"Variety of tests not 
found in other per- 
formance scales. 

. . . Adequate 
age norms are not 
presented” (Whit- 
mer, 7, p. 215). 

Goodenough 
Draw*a-Man 
(World Book) 

3 '/ 2 - 13'/2 

Drawing a man 

Can be used in many 
cultures. Requires no 
special materials 


Pintner-Paterson 
Scale of Per- 
formance Tests 
(Psych. Corp.) 

4-16 

1 5 subtests: form- 
board, manikin, pic- 
ture compiefion 
formboard, etc. 

Earliest elaborate bat- 
tery. Widely used 


Wechsler-BeJievue 

9-adull 

Block design, picture 

Comparobie verbal 

“A valuable clinical 

Performance 

Scaie 

(Psych. Corp.) 


arrangement, digit 
symbol, etc. 

scale (see Chap. 7) 

instrument. . . . 
Patterns indicative 
rather than invari- 
ably diagnostic” 
(35). 


" The reader should reler to the .source cited for a iull statement of the reviewer’s opinion. 


The psychologist has at his disposal an almost infinitely varied range of 
performance batteries. Some of the simplest present only one task. If more 
tests are required, as many as are desired may be added until the information 
sought is obtained. Tests vary in other characteristics also: Some are espe- 
cially interesting for children; some make small demands on coordination 
..and are si^erigr for children with physical defects; some tap simple memory 
_and perception, where others require systematic planning and reasoning; 
and sblMir 

Advantages 

It has already been mentioned that performance tests provide essential in- 
formation, about children who have been handicapped on verbal tests. They ' 
also are helpful with children who, because of school failure, are discouraged 
in verbal tasks, but who have not been conditioned against the novel and 
intriguing manual puzzles. For a group of borderline mental defectives, rat- 
ings were obtained on how well they were able to adapt to society. Binet 
scores correlated only .57 with the ratings, compared to a correlation of .77 
for the Porteus mazes ( 22 ) . 

The other, major advantage of performance tests is the admirable oppor - 
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tunity-they-aftord for clinical observation. The great variety of tasks, the high 
interest they usually elicit, and the variations visible at each step of the task 
make them far more helpful than verbal tests for studying some types of 
cases. Procedures for observation are not standardized but, as in the Wechs- 
ler test, call upon the tester to use his full alertness and insight to identify 
deviations which may have clinical meaning. 

An important reason for including performance tests in studies of adults is 
that different mental functions decline at different rates after maturity is 
reached. Language tests and arithmetic tests remain stable well into the 30’s, 
befoi'e slo-W. decline sets in. But performance on non-language tests begins to 
decl ine i n _ymy. early adulthood (26). This decline, in perceptual and speed 
tests^a nd p erhaps in all tests involving spatial reasoning, makes it important 
to include all types of tests in assaying ihe competence of a mature adult. 

Performance tests have sjpecial importance in research attempting inter- 
cultural ^comparisons. Some performance tests have cultural loadings, but 
they are free from the most obvious influences of differences in educational 
opportunity and differences in verbal culture. 

1. In what ways, if any, might cultural differences affect performance on each of 

the following tests? 

a. Formboard and puzzle cutouts. 

b. Wechsler Picture Arrangement (Fig. 26). 

c. Porteus mazes (Fig. 1 ) . 

Limitations 

Performance tests have the practical limitation that they inust be admin- 
istered individually, or in small groups, which is time-consuming. This may 
be converted into an advantage if the opportunity for observation is_ utilized. 
Competent observation, however, requires an observer with considerable 
clinical experience, which again raises the cost of testing. 

Si ngle pe rformance tests are far from jeliable. Each is a small sample of 
ability. Temporary response sets or work habits play a major part in de- 
termining score, and the habit which is rewarded in one test may lead to a 
low score on one scored differently. Reliabjlity in per formance testing is 
obtained by using batteries of tests, since the errors of measurement in many 
tasks tenJ to cancel each other combiiied. One set of 13 tests gives the 
following reliability coefficients for 10-year-olds: .76 for boys; .54 for girls 
( 11 ). 

The “intelligence” measured, by performairne tests is not quite .the same 
as that terted^ die Binet and other verbal tests. In Wechsler s test, the cqr- 
.relatio n_o f VeA al'and Performance scores was only .83 even after correction 
to cornpensate for errors of measurement. This coiTelation would be 1.00 if 
the two scores represented exactly the same ability. The i elation of perform- 
ance scores to Binet scores is illustrated in Figure 32, which shows scores of 
superior seven-year-olds on the Stanford-Binet, and a composite of the Pint- 
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iier-Paterson, Porteus, and Healy Picture Completion tests. The most striking 
change indicated in the figure is that of Margaret. The precise cause of the 
shift is uncertain, but she is known to be greatly superior in academic skills, 
with a heavy build and a disinclination or inability to hurry, and great dif- 
ficulty in social relations with adults. 

The correlation of Binet IQ with performance IQ is always positive, but 
the tests do not measure entirely the same things. For children around age 13, 
one study found correlations of ,55 (boys) and .67 (girls) between Binet IQ 
and six selected performance tests. But when the correlation was computed 
on a composite of fourteen run-of-Ure-mine performance tests, it dropped to 
.41 (boys) and .49 (girls) (11). Clearly, not all performance te.sts are equally 

Rank Order, 

Average 
Performance 
Test Score 

Douglas 
Amy 

Christopher 
Virginia 
Dick 
Walter 
Carol 
Allison 
Mark 
Margaret 

Fig. 32. Rank of 10 superior seven-year-olds on the 
Binet test and on a battery of performance tests (6). 

good as measures of whatever the Binet test measures. Further light on the 
discrepancy between Binet and performance IQ’s is obtained from results 
with a small number of canal boat children in England. Tliese children live 
in an economically and educationally impoverished environment. The corre- 
lations found for them were: performance test vs. Binet test, .68; performance 
test vs. educational level, .26; Binet test vs. educational level, .58 (11). The 
iinpprtant coiiclusion is. that a-perfonnance test is not intercliangea ble with a 
Binet-test. One need irot argue which is truest-'ljofli may be accur ate state - 
men ts about different abiliti es. Prediction in situations requiring verbal facil- 
ity will probably be highest if a verbal test is used. In many situations, in- 
cluding general adjustment to the community and to jobs, a performance 
score gives relevant information. 

Some writers have questioned whether performance batteries measure 
general intelligence at all. Since subtest intercorrelations are low, it could be 
argued that they measure different highly specific aspects of perception and 
coordination. The lowness of the correlations, however, is readily explained 


161 Margaret 

143 Douglas 1 
143 Amy J 

134 Carol 

132 Christopher 
129 Virginia 
1 20 Allison 
110 Mark 
109 Dick 
104 Walter 



Rank Order 
IQ Binet IQ 
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on the basis of the unreliability of single tasks. Gaw found sufficient positive 
intercorrelation to support tlie view that performance tests do measure a gen- 
eral mental factor. But simple timed formboards, which emphasize speed 
rather than more intellectual operations, show little correlation with other 
mental measures (11). An American study by Morris tends to support this 
view (20). The common performance tests probably indicate general mental 
abjli^nim'e-Hdtli-yminger.and inferior subjects than with superior arid older 
indiyiduals. 
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Fig. 33. Healy Pictorial Completion Test. (Courtesy C. H. Stoelting Co.) 


2. Mental age scores for Dick (Fig. 32) range as follows: Stanford-Binet, 8-3; 
Pintner-Paterson, 12-6; Porteus Maze, 11-0; Healy P. C., 7-1 (6, p. 277), 
What must be considered as possible causes of these variations? 

3. What justification can be offered for the common clinical practice of giving a 
verbal test as well as a performance test, even though it is known that the sub- 
ject has a language handicap? 

4. In view of the definitions of intelligence 'on page 107, do performance tests 
measure intelligence? 


Illustrative Results 

A sample of the information the performance test yields to a skilled tester 
and observer is indicated by the following record reported by Biber and her 
coauthors. Mark is nearly eight years old. With his Binet IQ of 110, his MA is 
8-8. On the performance tests, he reaches 9-6 (Pintner-Paterson), 9-6 (Por- 
teus), and 8-8 (Healy), This is the detailed record (6, pp. 527-528) 

“The most striking feature of Mark’s examination was his extreme lack of confi- 
dence and his desire to do what was expected of him. This was manifested by his 

‘ This quotation and Fig. 32 taken from Child life in school, a study of a seven-year- 
old group by Barbara Biber, Lois B. Murphy, Louise B. Woodcock, and Irma S. Black, 
published and copyright by E. P, Dutton & Co., Inc., New York. Copyright 1942, E. P. 
Dutton & Co., Inc. 
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constant reference to the examiner. Throughout all the tests, although he said little, 
it was evident that he was referring back to see whether the expression on tlie ex- 
aminer’s face indicated approval. 

“In the Healy Completion the examiner noticed that once when she gave him a 
friendly smile he was content to leave an inferior solution, as if he wore guided 
much more by his wish to plea.sc than by his own good intelligence. Although she 
busied herself with papers and ti’ied to pay as little attention as is compatible with 
a test situation, it was impossible to prevent this. The directions in tiie Healy Com- 
pletion to look the work over carefully and see if there are any changes to make 
seemed to imply critici.sm to Mark, and he removed a block which was correctly 
placed and substituted a blank. His first responses were all good. In this test, he 
placed the first three accurately; then, apparently, he began feeling anxious or un- 
certain, and the last three he placed were blanks. It seemed tliat he was using the 
blanks as a way of avoiding committing himself to a mistake, and that he felt that 
he would rather do nothing than to get the wjong rc.sult. This tc.st was the most 
plainly motivated by his desire for approval, although there were indications of it 
throughout the other tests as well. 

“In the Pintner-Paterson series, he seemed to be less conscious of the examiner, 
probably became he felt more sure of himself in these tests. When he was un- 
certain, as in the Shij) and Triangle Tests [formboards], he would look up shyly 
as he worked. Several times he commented, ‘That’s easy.’ 

“The first part of the maze series he enjoyed, working cpiickly, accurately, and 
with ease. After his first failure in Year VIII [the mazes are scaled in difficulty], he 
seemed much more uncertain and slow. After practically every one he said. Tin 
not going to do any more of these.’ With constant encouragement, he went on and 
completed Years X, XI, and XII, although he had four trials on Year XII. Toward 
the end of the series, there was little evidence of real effort on his part, but rather 
he seemed to be going through the motions because the examiner urged him on. 

“Probably no test results on Mark are completely accurate because other factors 
besides ability are so definitely involved in his behavior. Difficulty did not stimu- 
late him, as it did Douglas and Amy, for instance, but simply discouraged him and 
left him tense and uneasy. He was responsive to praise, but always with a ques- 
tioning expression, as if he were trying to ferret out what one really thought of 
him. It was consistent with his total defensive attitude that he offered very little 
information during the test and several times he responded very shortly to ques- 
tions that the examiner put to him.” 

5. What light do these observations throw on the interpretation of Mark’s Binet 
IQ? 

6. List several of the characteristics a tester should attempt to observe in giving 
an individual performance test. 

7. In many clinical examinations, only part of a batteiy of performance tests is 
used, to save time. On what basis would you decide which subtests to retain? 

TESTS OF EARLY DEVELOPMENT 
Description 

Much of the research on intelligence and persona lity has been _ done o n 
ohildreh below the age. of six. For such studies, as well as for clinical use, testsT 
have been needed to measure the present development of a child and to pre- 
dict his later standing. Numerous tests have been developed for comparing 
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various aspects of child development. The aim in these tests has been to de- 
termine how far the child has progressed in the sorts of development norm- 
ally taking place at his age. Little effort was made, in the beginning, to re- 
strict infant and preschool scales to measui-es of intelligence. Workers have 
borrowed heavily from each other, and all scales have considerable resem- 
blance. The discussion below centers on Bayley’s definitive study, which 
dealt with most of the questions of importance in tliis area. 

We cannot test a one-year-old on abstract thinking; we don’t even know 
if he is doing it at this age. We cannot test the three-year-old on complex 
reasoning, because he has not developed adequate sustained attention and 
understanding of directions to attempt the problem. Investigators have con- 
centrated on those aspects of behavior which can be identified objectively 
in the young child. Bayley’s study (4) used a composite of 185 items from 
existi ng sc ales, of wEch the following are representative. The number in 
parent heses is the scale placeme nt — th e ageJnjnonths at which the develop- 
meS is norm^ly f ound. 

(0.6) Lateral head movemeirts, prone. 

(1.4) Vertical eye coordination. 

(3.0) Reaches for ring. 

(3.6) Manipulates table edge. 

(5.5) Discriminates strangers. 

(5.9) Vocalizes pleasure. 

(6.6) Lifts cup by handle. 

(8.6) Says da-da or equivalent. 

(9.3) Fine prehension. 

(9.9) Rings bell purposefully. 

(13.5) Makes tower of two cubes. 

(16.6) Turns pages. 

(20.1) Square of tr iangle in Gesell formboard, reversed. 

(21.5) Names three objects. 

(25.0) Understands two prepositions. 

(28.4) Picture completion, one correct. 

(34.6) Copies circles, three trials. 

(35.6) Remembers one of four pictures. 

8. Characterize the sorts of items used within each year level (assuming that 

those listed are characteristic). i . v c* 

9. How do the abilities tested in this scale differ from those tested in the btan- 

ford-Binet? , 

10. To what extent would differences in experience give some children an advan- 
tage on the tasks listed? 

11. How well do these items fit Binet’s definition of intelligence.'’ 

Because the most observable developments in youngjdiildreti ar^iunoton- 
facility, s calesj i pve iuTilifTdgd n ^^ Most writers make 

‘^ommSfifs such as Boynton’s: “When the Linfert-Hierholzer Scale attempts 
to measure intelligence in terms of the child’s ability to follow visually a ball 
or to use a spoon in eating, or when Charlotte Buhler looks for intelligence 
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in a child’s smile or in tlie fact that he seeks a lost toy, it is apparent that the 
procedure involves matters which neither the layman nor the psychologist 
would regard as integral aspects of intelligence at a later age” (19, p. 629). 
The scales do, however, permit one to identify the child who is making less 
progi'ess tlian others, in the sorts of development usually studied at these 
ages. 


Table 22 Representative Tests of Infant Development 


Test and 
Publisher 

Age 

Level 

Types of 
Performance 

Special Feotures 

Comments of 
Reviewers" 

California First 

1 month to 

Coordinations, vocall- 

Standardized on some 

"Fairly accurate ap- 

Year Mental 

18 

zatlon, simple 

children at oil ages 

proisol of child's 

Scale 

(Stoelting) 

months 

formboordSf etc. 


status at time of 
testing. . . . Not o 
basis for predicting 
mental status after 
infancy" (Good- 
enough, 7, p. 205). 

California Pre- 

1 Vi to 6 

Manual facility, block 

Yields MA, IQ, and 

"Good tests, inade- 

school Mental 

years 

building^ language 

proBle of develop- 

quately standard- 

Scale 

(Univ. of Calif. Press) 


comprehension, re- 
call, etc. 

merits 

Ized. . . . MA-IQ 
type of scoring seri- 
ously misleading" 
(Costner, 7, pp. 
206-207). 

Cattell Intelligence 

2 months 

Attention to test ob« 

An extension of the 

"Items very well 

Scale for Infants 

to 4 Vi 

lects, manipulating 

Stonford-Blnet, with 

adopted to infants. 

and Young Chil- 
dren 

(Psych. Corp.l 

years 

cubes, pegboard, 
etc. 

some duplication 

. . . Doubtful va- 
lidity below 1 2 
months" (W ellmon, 
8). 

"Value now primarily 

Linfert-Hierhoizer 

1 month 

Coordination, re- 

Relation to later Intel- 

(Wililaras and 
Wilkins) 

to 12 
months 

flexes, otfenHon 

ligence negligible 

historical" (Bayley, 
8). 

MerrilUPalmer 

1 8 months 

Pegboards, tnforma- 


“Predictive value very 

Scale 

to 71 

tion, block build- 


limited. . , . Out- 

(Stoelting) 

months 

ing, picture puzzles, 
etc. 


standing merit is 
children's interest. 


. , . [Many] items 
test motor skills 


rather than insight- 
ful behaviors" 


(Bayley, 7, p. 228). 


Minnesota Pre* 

T/2to6 

Vocabulary, draw- 

Equivalent forms avail- 

"Not very satisfactory 

school Scale 
(Educ. Test Bureau) 

years 

ing, response to 
pictures, digit 
spon, etc. 

able 

with many children 
under three. . . . 
Most suited for chil- 
dren from 3 to 5 
years of middle 
ability ranges" 
(Stutsman, 7, pp. 
231-232). 


® Tile reader should refer to the source cited for n full statement of the rcviower's opinion. 


Limitations 

Bayley’s statistical findings are worth considerable attention, not only be- 
cause of their direct value but because they represent a type of careful 
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analysis of tests that should be made in other areas. The first finding to con- 
sider relates to reliability. One group of cliildren was tested every month on 
her scale, and the split-half reliability coefficients shown in Figure 34 were 
obtained. 

Bayley considers that the low reliabilities at ages 1-3 months. and, 12-15 
months~i- ^u ii'e sp ecial explanation. One reason is that more items enter 
scores at some ages than at others. Tests dependent on maturation she con- 
siders more reliable than those based on experience, which would explain 
higher reliabilities where tests of maturation predominate. A more widely 
applicable point is that at a level where a par&ular type of behavior is just 



Fig. 34. Reliability of Bayley's infant test at various ages (4). 


emerging, the pattern is diffuse, varied, and inconsistent from time to time; 
measurement ot sucBTfunctidhs is therefore unstable. This accents no^only 
fo^the unreliabihty of neon ate test s but for unreliajffility at the ages around 
12j»OB^s^hSeme”cKild must first make adaptations to an adult examiner. 
Bayley’s data illustrate that a test having satisfactory over-all reliability may 
be unreliable at certain levels or for certain groups. 

12. How do Bayley’s coelHcients compare with the reliability of the Binet test for 
single-year groups (cf. p. 126)? 

13. Wliat implications does the unreliability of emerging behaviors have for test- 
ing each of the following? 

a. Social behavior of preschool children. 

b. Vocational interests of adolescents. 

c. Political attitudes. 

d. Skill of manual workers etuly in the training period. 

e. Relative popularity with boys, of girls in early adolescence. 

'The trend of mean scores with age rises in a fairly smooth curve. Figure 35, 
representing the standard deviations of scores at various ages, is more aston- 
ishing, especially as the same children were tested at different ages. The test 
apparently does not reveal individual differences equally at all ages. Bayley, 
in casting about for an explanation, studied the performance on different 
types of items. Items were classified in two subtests, one representing sensori- 
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motor behavior (including simple vocalization) and one representing adap- 
tive behavior (problem solving, imitation). The results plotted in Figure 36 
were obtained. Babies differ in sensorimotor behavior most at age six months, 



Fig. 35. Standard deviations of infant test scores at 
various ages (4). 



0 2 4 6 8 10 12 18 24 30 36 

Age (Months) 

Fig. 36. Standard deviations of scores on sensorimotor 
and adaptive items (4). 


after which tliey become more alike. The rapidly developing infants have ac- 
complished all they can in this type of development, and the slow ones catch 
up. As measured by this test, individual differences in adaptive behavior in- 
crease steadily with age, except near 12 months where the adaptive items 
contain a motor element. Essentially, the test seems to sample two different 
abilities, one of them being sensorimotor growth rather than intelligence. At 
different ages, the test measures different things. 

There is no evidence that early development along non-mental lines is a 
good predictor of later mental prowess. Infant tests have been conspicuously 
non-predictive. Bayley makes the statement that mental test score at three 
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years is better predicted by the education of the parents than by anj'thing 
the diild hiniself does during his first’ year (4, p. 73). Since we rarely care 
about preschool intelligence except on the assumption that it predicts later 
intelligence, this is a most serious limitation. Fortunately, adaptive and in- 
tellectual functions become easier to measure by age two to four, so that 
mental tests for the later preschool years have been somewhat successful. 



Fig. 37. Correlation of preschool tests with Stanford-Binet 
intelligence at age seven (12). 


Honzik’s data on tlie predictive value of infant tests show this rise in cor- 
relation (Figures?). 

Anderson states four precautions required with infant tests (3, p. 376): 

“1. The earlier . . . the measurements are made, the less reliance can be placed 
on a single measurement or observation, if that measurement or observation is 
used for predicting subsequent development. 

“2. The earlier ... the measurements are made, the greater care should be 
taken to secure accuracy of observation and record and to follow standardized 
procedures. 

“3. The earlier ... the measurements are made, die more account should be 
taken of the possibility of disturbing factors, such as negativism and refusals, that 
operate as constant errors to reduce score. 

“4. Since development is a timed series of relations or sequences, there are for 
many functions periods below which only a small portion of the function can be 
measured and above which a progressively larger portion can be measmed. Hence, 
the possibilities of prediction are Bmited and progression with age is not an infalli- 
ble indicator of the value of a measurement. Every effort should be expended to 
secure the most accurate and predictive tests by standardizing tests against multi- 
ple criteria [particulaidy measures of ability in later life] rather than against single 
criteria." 

OTHER INDIVIDUAL TESTS 

Individual mental tests for special clinical purposes have departed in sev- 
eral ways from the common tests so far described. Of these miscellaneous 
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tests, only the Kent emergency tests are sufficiently prominent to require at- 
tention here. In contrast to the usual tests which require a large part of an 
hour, Kent designed her materials to fit into the clinical “emergency” where 
a rapid judgment must be made and a quick test would reduce error in lud g- 
"“merit Her set of five oral tejsts and seven written tests, each requiriug.a.^ut 
one minute, provides great flexibility and reasonable accuracy (16). The 
widely used EGY test was illustrated on page 22. The tests were helpful in 
many military situations, where large numbers of men were given quick 
psychiatric checks. Kent, in designing the battery, intended that the clinician 
choose whichever tests would be most helpful for each case; this contras^ to 
the usual psychometric plan of testing each person on the same items. 

Tests used for tlie battery were generally designed to maximize oppor- 
tunity to observe personal characteristics. In verbal tests, Kent recommends 
noting the subject’s attitude to tlie examination, his spontaneity, his friendli- 
ness or hostility, his response to the examiner’s personal approach, suggesti- 
bility, interest or indifference to the whole test and to particular materials, 
willingness or reluctance to hazard a conjecture, sensitiveness when forced 
to acknowledge that he cannot answer a question, suspiciousness, and change 
of attitude at any time in the test. 

A nother departure in individ ual, testin g is the “concept formatiqi:^te|t, a 
technique cleslgned primari ly to reveal the way in which a clini cal patien t 
ttoiks. Since this technique is moii:e.iinpQ]:tajat„fl s an observation than for 
ineas uring gen era l abil ity, it is discussed in Chapter 19. 

GROUP TESTS OF VERBAL INTELLIGENCE 
Problems of Group Testing 

Group te sts are far more commonly used than individual tests because of 
their economy and prac^cality. Particularly in dealing with masses of sub- 
jects, whetirer in the Army, industry, schools, or research, group tests are 
essential. Furthermore, u nder favor able conditions they are a_s reliable and 
have a.s high predictive validity as do coini)ai able indi vidual test s. But they 
have certain limitations. 

Group tests almost i nva riably make demands upon language. The ques- 
tions are usually verbal, and even ’when they are hot, the directions are 
verbal, with few exceptions. In additipn.tp the bilinguals, etlinic gi'Oups, and 
the b lind and jleaf who have trouble on verbal individual tests, the grou p 
test handicaps tlwse.with partial loss of sig ht or h earing., and those with read- 
ing difficulties . So m uch is this the cas^^at a l o\v score on the typical group 
inte lligen ce test at the high-school level is as likely to represent inafEsctive 
read ing as itl s to represent low general_.ability- 

Group tests 'are, with fe w excep tion s, speed t ests.. This makes for uniform 
admini.stration and perihits good prediction in most situations. A fow stu- 
dents, and many aduhs, tend to earndower relative scores on sneeded tests 



OTHER TESTS OF GENERAL ABILITY 


173 


than they would if the same questions were given without time limit. Most 
tests combine power and speed, presenting items in order of difficulty so that 
each student encounters the items he can do. Table 23 indicates the effect on 
scoi'e when pupils are given added time on three typical tests. Evidently most 
pupils finished in the standard time all the items they could do. For occa- 
sional cases, of course, speed will still be the principal factor determining 
score. The currerit trend in making new tests is to provide ample time .for 
neaj'ly everyone to finish, ot, if the test is speeded, to not chon the test into 
numerous subtests with time limits of five minutes or less. 


Table 23. Effect of Giving Pupils Additional Time on Group Intelligence Tests (10) 


Age of 
Pupils 

Number of 
Pupils 

Tut 

Standard 

Time 

Extra 

Time 

Mean Points 
Earned In 
Standard 
Time 

Mean Points 
Earned In 
Additionai 
Time 

9-10 

223 

Otis Alpha 
Non-Verbal 

20 min. 

30 min. 

65.0 

1.1 

9-10 

226 

Henmon-Nelson 

30 min. 

20 min. 

54.1 

3.4 

13-U 

235 

Otis Beto 
(verbal) 

30 min. 

1 5 min. 

60.4 

0.9 


14. In the Air Force Qualifying Examination most applicants finished in 90 min- 
utes, but a few kept working for 5 hours. What evidence would justify tire 
decision to set a time limit of 3 hours on the test? 

15. What argumenls can you advance for and against the use of short, separately 
timed subtests in a 40-minute mental test? 

Group tests relv on q uestions where the answer is a . single .j'esponse or a 
choice among alternatives. Such re^onses are objectively scorablcj but in 
many ways less informative than the more varied responses obtainable on 
the usual individual test. The gyoup test is somewhat limited in the sorts of 
abilities tapped — ability to i nvent solutio ns is notoriously unmeasured in the 
usual reasoning items. Tests often contain a few unsound items, in which 
a bright student can find a way to justify an answer other than the test 
maker s. Multiple-choice tests are more affected by chance than questions in 
individual tests, and response sets are found in some of the items used. 

For some subjects, group administration does not yield the most favorable 
results. The skilled examiner, establishing rapport in an individual test, will 
overcome the inhibitions that prevent a timid subject from giving all the 
responses he can, but in the group test such rapport is hard to establish. In- 
quiry about ambiguous answers on an individual test may raise the score of a 
subject whose offhand answers are inadequate, but he gets no chance to ex- 
plain himself in the group test. The g roup tes t is bMedj?n the assurngiiOP 
tliat e veiY person understa rids Jthe.pature-and .purpQS&.o£-±he. testiag..aB,d, 
wants to do his best ; wli^ver these ideal conditions are not met results are 
iiiwlid. in individual testing, the examiner can at worst usually recognize 
when the score does not do justice to the subject’s ability. 
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Table 24. Representative Group Verbal Tests 


Test and Publisher 

Grade Level 

Special Features 

Comments of Reviewers'* 

ACE Psychological Exami- 

High school, col> 

L (linguistic) and Q (quonti- 

“A well-planned hour 

nation (Educ. Testing 

lege 

tative) scores. Primarily 

test” (R. L. Thorndike, 

Service) 


for college guidance (see 
Table 28) 

7,p. 201). 

“No evidence as to the 
validity of Q and L 
scores" (Dunlap, 7, 
p. 200). 

Army Alpha: First Nebraska 

High school, 

Early forms once used widely 


Edition (Sheridan) 

adults 

in research. Revision gives 
verbal, number, and rela- 
tionships scores 


Army General Classification 

High school, 

Used in World War II. 


Test (U. S. Army; Civilian 
form, Sci. Res. Assoc.) 

adults 

Norms for vast number of 
adults 


California Test of Mental 

Kgn.-1; l-3;4-8: 

Yields diagnostic profile 

”The unabbreviated bat- 

Maturity (Calif. Test Bu- 

7-10; 9-adult. 

showing MA in separate 

teries are among the 

reau) 

Various short 

tasks, 'language" and 

very best. . . . Lan- 


forms 

“non-language” IQ's 

guoge enters both 
IQs” (Kuhimann, 7, 
p. 209). 

“Doubtful whether two 
tests of memory are 
sufficient and whether 
there are enough sub- 
tests in other group- 
ings for differential 
judgment” (Garrett, 

8). 

“Serves [for] preliminary 

Henmon-Nelson (Houghton 

Grades 3-8; 7- 

Easily administered 30- 

Mifflin) 

1 2; college 

minute test 

rapid exploration and 
rough classification” 
(AnastasI, 7, p, 221). 

Kuhlmann>Anderson (Educ. 

Overlapping 

Variety of test materials, 

"Relatively less depend- 

Test Bureau) 

forms, grades 

scaled over continuous 

ent upon reading skill 


1-12 

range. Relatively hard to 
administer 

than most other group 
tests" (Marzolf, 8). 

Ohio State University Psy- 

Grades 9~col(ege 

Unspeeded test of verbal 

“[Yields] best predictions 

chologicol (Sci. Res. Assoc.) 


abilities 

now available for an 
over-all academic ap- 
titude instrument at the 
college level” (Guil- 
ford, 8). 

Otis Quick-Scoring (World 

Grades 1-4; 4-9; 

Easily administered 30- 

“Otis tests among the 

Book) 

9-college'' 

minute test 

easiest and most eco- 
nomical to score. . . . 
Hazardous to attempt 
prediction In individual 
case jin grades 1-4]” 
(Kuder, 8). 

Terman-McNemar (World 
Book) 

Grades 7-1 2 

Measures verbal abilities 

“Restriction to verbal ma- 
terials has had the 
effect of narrowing 
and purifying the func- 
tion measured” (R. L. 
Thorndike, 8). 


" The renclor should refer to the source cited for a full statement of the ri*viewer*.s opinion. 

® These forms are known as Alpha, Bela, and Gamma. Numerous other tests by Otis are similar. Won- 
uerlic has prepared a revision for industrial use. 
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“Group” tests are frequentlj;^ administered individually. In educational, vo- 
cational, and adjustment counseling a group test will often provide the de- 
sired data. Since group tests can be given by less skilled testers than indi- 
vidual tests require, they are advantageous. 

Characteristics of Group Tests 

Table 24 lists the features of a number of group tests. Because of the ir wide 
saleJ;o. schools, it has been profitable to design group mental tejts with 
greate£_car.e.tlian is true for most p.sychologicai tools. The better tests have 
been carefully standardized, and some provide comparable forms for use at 
all school grades from primary to high school. Items have usually been care- 
fully selected, and directions and manuals worked out so that tests can be 
used without psychologically trained personnel. 

The commoEL-tests are heavily verbal. While several of them include prob- 
lems requiring reasoning about numbers or geometric forms, verbal items 
account for the bulk of the variation in scores. This is warranted by the fact 
that verbal items generally correlate well witli other sorts of items. Nearly all _ 
group tests are thought of by their designers as scholastic aptitude tests. Even 
thelCrmy, ini Army Alpha and the Army General Classification Test, sought to 
predict which men could most readily learn new duties. Since school de- 
mands verbal facility at every turn, this justifies the usual verbal loading of 
tests. 

The majori^ of current tests are adequately reliable. In one tabulation of 
coeflBcients of equivalence from common tests, nearly all have coefflcmnts be- 
tween .88 a nd .94 ( 19, pp. 631-632). Authors of group tests have often vio- 
lated th^rule that split-half methods must not be used with speeded tests 
(p. 69). As a result, the reliabilities claimed for some tests are illegitimate. 
Only tests which nearly all pupils finish can be studied by the split-half 
method. 

One ca ution mus t be repeated: Different tests of the same function are not 
interc han geable. Not only have authors .use.cLdifferent standardizing groups, 
but they converted scores to “IQ’s" which resemble Binet IQ s only in having 
a inn . If persons with Binet IQ’s of 12S'took nine different group 

tests published before 1924, they would earn the following average IQ’s on 
the different tests: 124, 124, 127, 128, 131, 138, 138, 143, 149 (18). While 
more recent tests have been based on larger norm groups, the IQ still rises 
or falls according to what test is given. In one recent study, over 2200 nine- 
and ten-year-olds took four accepted group tests. The number of pupils re- 
ceiving IQ’s below 80 was as follows: on tire Henmon-Nelson test, 119; Kuhl- 
mann-Anderson, 53; Otis Alpha Verbal, 47; Otis Alpha Non-Verbal, 68. The 
differences in the proportion of high IQ’s were even more marked, but partly 
explained by the fact that the two Otis tests were too easy to test the full 
ability of the better pupils (10). How one person’s tested IQ varies is il- 
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lustrated by the scores earned by an adolescent boy during a three-year pe- 
riod (Table 25). 


Table 25. Intelligence Test Results for John Sanders* 
(14, p. 90) 


John's 

Test CA IQ 


fCuhIma nn* Anderson 
Tdrman Group Test A 
Terman Group Test B 
Stanford-Blnet(1916) 
Kuhimann-Anderson 
Terman Group Test A 
Terman Group Test B 
Terman Group Test B 


12.1 

101 

12.6 

119 

12.7 

100 

12.8 

109 

14.1 

91 

14.5 

113 

14.6 

112 

15.1 

115 


Group verbal tests do not differ very much in oyer-all character. The dif- 
ferences that lead the tester to choose on e test rather than another relate to 
the convenie nce of a-test in a particular^situation. Carefully prepared group 
tests have comparable reliabilities and validities, and tests for a given age 
test about the same functions. The lengths of the tests vary from IS-minute 
tests used in rough classification in industry to the 2-hour tests used to meas- 

56. 6, 4 , 7) Si 8) 9, What number should come next? 

(1) 7i (2) 10, (3) 8, (4) 6, (6) II 0 [I] 0 [3 [1] 

67. Which word does not belong with the others? 1 apparatus) 

2 foundation, 3 equipment, 4 device, 6 appliance. [T] [|] [3] [T] [|] 

68. Inconsequential means about the same as; 1 sorry, 2 incorrect, 

3 useful, 4 unimportant, 5 necessary 00000 

69. LU isto I 1 as m isto:l UJ 2 ^ 3 CZi 

^ ® LJ 00000 

60. On an addition test a boy got 12 problems right, giving him 

an accuracy of 76%. How many problems did ho missf 

(1) 8, (2) 9, (3) 6, (4) 4, (5) 3 00000 

61. The United States entered the World War in: (1) 1914, (2) lois, 

(3) 1916, (4) 1917, (5) 1918 00000 

Fig. 38. Items from Henmon-Nelson Test of Mental Ability, 

High School Examination.®' 

ure college aptitude. In general, longer tests are required for fairly homoge- 
neous groups or for refined measurement. Another major difference between 
group tests is in convenience of adminis tration a nd.8Coring. The quick-scor- 
ing features of the'fecent tests are greatly appreciated where large numbers 


* See also p. 132. 

^JFxom the Henmon-Nelson Tests of Mental Ability, Houghton Mifflin Company. 



1946 Revision Adveneed 

Grades 9‘AduIt 

CALIFORNIA TEST OF MENTAL MATURITY— ADVANCED SERIES 

Dovisod by Elizabeth T, Sullivan, Willis W. Clark, and Ernest W. Tiegs 

Occupation or Grade .. ^ .(Q. •^) 

Date Ago....i^. Last . .O^^l ^9') . .Sex:@)F 

£;£I" SfiM .*!».. .. 

TEST FACTOR 

1. Visual Acuity . . . 

2. Auditory Acuity . . 

3. Motor Co-ordination . 


School or ^ ^Tt 0 

Orgonlzotion . 


-3£. 

0 

28 

29 

fiS " 

'k 

Law 

t) 


Avoraga 

lA 

**kigh 
rx w 

-/2 

0 

Ifl 

Average 

U 4Ch 

loHigh 

IS 20 


Lbiw 


AveMRe 

■'IHgh ' 


Menial Age 

Mo. 120 132 1M ISO 


ISO 192 204 210 228 240 300 


A. Memory 53 

4. Immediate Rocall^ . . 33 . 

5 Deloyed Recall (p. 161 20 J$ . 

B. Spotial Relationships * 45 

6. Sensing Right and Left* . 20 JM- . 

7. Manipulation of Areas* . IS — ^ . 

8. Foresight in SpoHol Sit'ri* 10 . 

C Logical Reasoning . . 60 - 

9. Opposites* 15 ..JL^ « 

10. Similarities* .... 15 « 

11. Analogies* .... 15 — . 
15. Inference (p. 14) . . 15 — ^ . 

D. Numerical Reasoning . 45 

12. Number Scries* . « , 15 * 

13. Numerical Quantity* . 15 — ^ • 

14. Numerical Quantity . . 15 — • 


E. 16. Vocabulory . 


50 

253 

100 


10 II 12 

2025 3o' 

13 * 14 

3^ A 

IS 

46 ‘ 42 

16 

1 

44 45 

17 

*46 

18 19 20 

4I ' io 

25 

si 53 

-t- 

30 

1 

11 JO 

■ 1 1 rm "I 

rrrr^^jfr 

2r 29 

21 

la 

11 92 

12 

,, 

A tan 

''J ' 

It t» - 

25 26 J7 

tt 

‘ 30 

ti i« 

' '35'' 

:b 

^0 

45 

19 >1 

12 IJ 1/ 

19 16 

12 


tl 19 

■» 

' 

2 


S 


1 

1 IB 

11 12 13 

4 If 

'i 


,_S 

6 

7 

1 

• 

16 

'Is' ' 

' 20 ' 

2S' 

' 1 

& 

‘ 35 

' 40 ' 

50 ' 

s'o 

2 1 

4 

1 


1 IB 

It 12 13 

14 

e 

t 4 


r* 

IB 

11 

12 II 

14 

1 

2 

9 


» 

a 7 1 

le It 

U 

2 14 8 

ITS 

• 

iT^ 

12 

1) 

14 

IB 

io'*''' 

'« ' 


' ■ 25 ' ■ ' 

io '35 

‘40 

^5 

2 

2 




1 9 10 

’ 

If 

1 

1 7 

a 

L.' 


tl It 

12 14 

IS 

3 4 

S 


a • 

IB 

II II 1) 

14 

If 

1 ’ i 5 

6 J 8 9 ’ 

10 It' 12 

|j 14 


' 2'o 2^ 

10 Jo 4 5 

4's 


M. 


60 


Total Mental Factors 

IA+B-i-C+DH-E) 

F. Language Factors . 

15+14+15+16) 

G. Non-Languogc Factors 153 

(Total Monlol Factori Minus F) 

Chronological Age . . . 
Actual Grade Placement . 

(Crido pupil Is in) 


*Non-lianeuaso tents. 


djl ^ W* ’ 

A7 11 I's 20 25 ' 30 * 

’ ~ n I I ■ f ■ I 


iilj 


4^ 


i*i« Im IroMBi^zarA 60 


g! 15 411 50 ro 


70 


45 .J 50 W 05 70 Bd' 90 93 70 




'loJ ills 1Z5 135 u's SO 


IgZ lioiiiiU"^!^ lA 11^ ijz zi{< tjetjatje oro 3U 

10 II IZ'unts 14 IS IS 17 1 0 19 2 0 25 30 

lO.U ' 50 ' sj 70 ' no' 90 llHh.il 0 1^0 (3o'i4i!i60 ' ' 


Yr. id il i2 <3 

14' fs 

I's 

1*8 1’a ^ 


a'o 

Mo. 120 132 144 156 

168 160 

192 284 

2l'6 228 240 

300 

380 

SUMMARY OF DATA Score 

M,A +- 

C.A* = 

I.Q. 



Total Mental Factors . . . J^S, 

JP. 

182 

106 



F. Longuege Faclors , . . . — :^LZ 

202 

tsz 

/// 



G, Non-Unguoge Foctors . , . Kl 

ISTf 

UM- 

101 




Copyrishl. ISM. IMS. br CaUComU Ttl Bureau 


*Aee 16 and older, divide bj IS2 months. 


Fig. 39, Profile sheet for California Test of Mental Maturity, 
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of papers must be processed in a short time. A third feature to consider in 
choosing a test is the suitability of the norms for the use the tester has in 
mind. This is not highly important, however, as local norms based on the 
group with which the subject will be working are generally more informative 
in the types of processing where group tests arc used. 

Few group intelligence tests make any attempt to yield diagnostic scores. 
One major exception, the Primary Abilities tests, is treated in Chapter 9. Most 
group tests yield only a single score, recognizing that reliable measures can 
rarely be obtained in subtests requiring less than 15 minutes. A few tests 

TEST 11. 

Directions: In each row, the first picture is related to the second. Find a picture that goes with 
the third picture in the same way. Put an X under it and write its number on the line to 
the right. 



Fig. 40. Items from California Test of Mental Maturity (Elementary), 
Analogies subtest.'' 


provide two scores, one emplmsizing verbal factors and one emphasizing 
.spatial, numerical, and non-verbal reasoning abilities. Sometimes the discrep- 
ancy between these two scores will call attention to a subject having a read- 
ing handicap or a language deficiency. 

The California Test of Mental Maturity (CMM) series is particularly 
interestipg in view of its attempt to make an intensive diagnosis of mental 
abihbjr. It contains a variety of objective subtests which yield an IQ of satis- 
factory reliability. In addition, the CMM yields separate “Language” and 
“Non-Language” IQ’s. The reliabilities of the lopg formjpr the Advanced 
test are .945 for the Language score, and .985 for the Non-Language score. 
The tWjQ-&cDrfis..correlate .63, which shows that somewhat diEerent abilities 

* Fig. 39 from the California Test of Mental Maturity, Advanced Series; Fig. 40 from 
ibid,, Elementary Series. Both are published with permission of tlie California Test Bu- 
reau, Los Angeles, Califorula. 
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them. This diagnosis is further broken into a profile showing 
ability in different ajeas, as illustrated in Figure 39. Care is needed in in- 
terpreting the profile to avoid giving undue weight to fluctuations due to 
chance. While the five factor scores have adequate reliability and measure 
fairly distinct abilities, attempts to, interpret ,the profile on single subtests 
(such as Manipulation of Areas) are most inadvisable. These sujjtests are 
thei'efore not reliable; sfrengtli or weakness on any single test is 
not significant unless it is confirmed by further observation. Since nearly all 
the tests involve more than one type of ability, users must study the content 
of the sections of the tests rather than interpret weakness in “memory” or 
“logical reasoning” at face value. 

16. What evidence would help you decide on the “true” number of cases having 
IQ below 80 in the group of 2200 discussed above? 

17. What is the best estimate you can make as to John Sanders’ “true” IQ? 

18. In common usage, is “scholastic aptitude” identical to “intelligence”? 

19. One section of the Ohio State Psychological Examination is a test of reading 
comprehension. Can you justify including this section in a test of scholastic 
aptitude? 

20. Tliere are thi'ee major quick-scoring techniques. The Henmon-Nelson test uses 
a booklet sealed at the edges. After the subject has marked his test, he tears 
open the booklet and finds that each answer has been transferred to a sheet 
where printed squares indicate the correct answer, .so that the score may be 
counted directly. The Ohio State test uses a pinprick method of transferring 
marks from the front sheet to a concealed, pre-keyed second sheet. Holes 
punched through marked spaces on this inner sheet represent correct answers. 
The IBM answer sheet, described on page 46, is the thhd method. Papers in 
this form may be hand-scored with a punched-out key. Consider the relative 
advantages of these three devices with respect to the following chai'acteristics: 

a. Ease of changing an answer iffter marteg it. 

b. Possible inaccuracy in obtaining scores. 

c. Speed and convenience when the subject scores his own paper. 

21. There has boon a trend recently toward short tests. For example, a 15-minute 
test (SRA Verbal Form) has been derived fi'om the ACE test, which requires 
about an hour. What limitations does such a shortened test have, compared to 
the long form? 

22. Describe the abilities required by the Henmon-Nelson test (Fig. 38). 

23. Suppose that some subtest scores-in the profile. Figure 39, change two points ■ 
up or down on retest. This is not unlikely on multiple-choice items. How great 
a change might this make in the profile of “factor” scores? Of scores on single 
subtests? 

24. The subtest reproduced in Figure 40 is titled “Reasoning-Analogies.” What 
other mental functions seem to be involved in this test, and would in part ac- 
count for differences in score? 

25. The Analogies subtest is included in the CMM score Non-Language Factors. 
Why have some writers criticized that score as not being free from tlie influ- 
ence of language abilities? Does this influence impair the usefulness of the 
non-language score? 

26. What interpretation would be made if a child has a Non-Language IQ of 120, 
Language of 90? What interpretation would be made if the Language IQ were 
120, Non-Language 90? 
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27. In the CMM Elementary test, pupils listen to a story about “The Pack Train.” 
In the story, a man goes to a mining camp by pack train, pas.sing a glacier and 
being threatened by a grizzly bear. After hearing the story, pupils go on to 
take other sections of the test. After an elapsed time of 25 minutes, the pupils 
are asked questions about the story. What does the tost measure besides gen- 
eral ability in “delayed recall”? 

28. What interpretations about Harold Brown arc suggested by his CMM profile, 
when you bear in mind the nature and reliability of the subtests? 

29. One of the subtests of the Tanaka group intelligence test (Japanese) requires 
the subject to cross slanting lines, making X’s as rapidly as he can, thus 
XXXXX//////- Such a subtest is rarely used in American intelli- 
gence tests. On what basis could the inclusion of such a test be criticized? 
What argument or evidence would justify including this subtest in a general 
menial test? 

Empirical Validity 

Because group tests were first thought- of as substitutes for the StaiJord- 
Binet test, then correlations with that test are an important evidence of 
validity. If the correlation is high, they measure the same thing as the Binet. 
Representative correlations with the Binet test actually found are as follows: 
CAVD, .87; Detroit Advanced First Grade, .85; and ACE, .58-.62 for college 
freshmen (19, 2). These correlations are based on typical school populations. 
Comparable evidence for clinical groups and for industrial workers is lack- 
ing. In^linical work, individuahtestsjire preferred over group tests. Although 
indivkluaTtesfrare less convenient, they' p rovide ir tbr e inf ormation. In iii- 
dustrial work, they are rarely. considei-ed, and validity is judged by the value 
of the group test in predicting a job criterion. 

A second line of evidence is to determine whether the group test prin- 
cipally measures speed. Apparently, ferom the studies available, sp eed of 
Work^is,P,0t the principal element causirig high or. low scores, on mQS.t grou p 
Jests (5, 10). But speed tests may be undesirable, even when speed is a 
minor factor. The Ohio State Psychological Examination was given to groups 
twice, one form timed and the other untimed. The correlation with college 
marks was .53 without time limit, .46 speeded (27). More recent forms of 
the test, unspeeded, have even better validity. 


Table 26. Relation of Reading Ability to Group Test IQ’s (9) 


Reading Ability of Pupils, 
Compared to Those of Equal MA 

N 

Mean IQ on 
Binet Test 

Mean IQ on 
Haggerty 
Group Test 

Difference 

Very superior readers 

36 

94 

114.5 

-1-20.5 

Superior readers 

224 

92 

109 

+ 17 

Retarded readers 

100 

107 

99 

-8 


The cnrrelaHnti.mf , ment al tests with tests of achievement is,.Juglu.With 
measures of reading, total score on CMM correlates .70-.78 for children, and 
language score correlates .80-.84. The non-language part correlates .36 to .56 
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Civilian Occupation 
Accountant. 


60 70 80 


Medical student 

Teacher 

Lawyer. 


Bookkeeper, general. 

Stenographer 

Reporter. 


Clerk, general 

Purchasing agent. 
Salesman. 


Telephone repairman. 
Artist. 


Musician, instrumental. 

Toolmaker 

Printer 

Machinist 

Policeman 

Sales clerk 

Electrician. 


Machinist's' helper 

Welder, combination. 
Plumber- 


Carpenter, general — 
Automobile repairman 
Tractor driver 


Painter, general 

Truck driver, heavy. 

Cook 

Laborer 

Barber — - 

Miner- 


Farm worker — 
Men in general. 


AGCT Standard Score 
90 100 no 120 




J 


Median 


130 


140 150 


t 


90th Percentile 
75th Percentile 


10th Percentile'' 

25th Percentile 

Fig. 41. Army General Classification Test scores for various 
occupational groups (after 23). 


(24). Poor reading lowe rs group test per formance (see Table 26). Correla- 
tions betweenTirteiligence tests and standaraized' tests of school knowledge 
include .82 (Detroit Alpha), .75-.85 (National InteUigence), and .78 (Kuhl- 
mann-Anderson) (19, pp. 631-632; 1 ). Apparently, group tests reflect school 
achievement (this is not a question of prediction, since both tests were ad- 
ministered at the same time). Kelley e.stimates that for children what one 
group verbal test measures is 95 percent the same as what another such test 
measures, is 92 percent the same as what typical reading tests measure, aiid is 
90 percent the same as what is measured by the typical informationa schoo 
achievement test (15, p. 208). This overlap is probably less for out-of-school 
adults. The implication. -of .^s-flnding .i«--that f 0 r-.pupils-.vdm.nfittGifiBt.,. 
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school experience, whether clue to poor schools, frecpient absence, emotional 
interference with learning, or other factors, the group test is far from an 
indication of capacity, But this does not interfere with attempts to use gioup 
test scores to predict the child’s probable success in school. Group mental 
tests are not essential to make predictions if a child has taken standard 
achievement tests. But when pupils from dilfcrent grade schools are pulled 
together in a junior high school and in other such situations, intelligence 
tests are important. 

Intelligence has direct significance for vocational guidance, as indicated 
both by the reports summarized iu Table 18 and in numerous studies of in- 
telligence levels of various types of workers. Tests of Army draftees show 
marked relations between test score and occupation ( 23 ) . Part of the findings 
obtained by the Personnel Research Section of the Adjutant General’s Office 
are reprodued in Figure 41. It is important to note that there is great over- 
lapping; the best men in one group are far superior to the least intelligent 
men in groups higher on the scale. Tliese findings do not show the relation- 
ship between test intelligence and ability within the occupation. In fact, in 
some routine occupations high-scoring workers are less satisfactory on the 
job than workers with low intelligence. 

The pragmatic question^ Do group intelligence tests give useful predic- 
tions? can bo ^swered easily, TJiey do work well, in hjdustry, in schools 
colleges, and in military classification. Tliey serve better in some places than 
others, and the particular test most suitable for one prediction may irSTBe^ 
best for another. But by and large, the empirical validity of group tests 
thoroughly justifies their wide application. 

30. On typical group tests, children from working-class homes, on the average, 
do less well than children from middle-class homes (10). Give as many ex- 
planations for this fact as you can. 

31. A study by Eells compares the performance on typical group mental tests of 
children from middle-class homes and from lower-class homes. Data for 
several test items are reproduced in Figure 42. Try to .suggest what mental 
processes would cause pupils to make the eiTors they did. 

32. What differences between lower-class children and middle-class children 
appear in the data of Figure 42? 

33. According to Army Alpha results, white adults in certain Southern states 
earned, on the average, lower scores tlian typical adults from certain Northern 
and Western states. Give as many possible explanations of this fact as you can. 

34. Outline an intelligence testing program for use in a public elementary .school, 
considering such questions as; How often should every child be tested? 
Should all testings be based on the same series? What provision .should be 
made to allow for the fact that group test IQ’s may be seriously wrong for 
some pupils? 


A Form for Examining Tests 

There is little purpose in attempting to describe here tlie details of specific 
tests, since the interests of readers will vary widely. But every reader should 




Responses of “middles” 9% fi!>% 2% 20% 3% omit 

Responses of “lowers” 27% 4tf% 4% 19% 3% omit 

From Oils Alplm Non-V<!rbal Test. Directions Kiven nmlly: tiio Ihroo filings limt are alike and draw a 

lino Lhroiigh Uic one Ihat is not like these Llirco.** 



Responses of “middles” 62% 0% 7% 28% 3% omit 

Responses of “lowers” 58% 1% 14% 24% 3% omit 


From Olia Alphii Non-Verhul mat. DirccMoiia anmr aa nbovr. „ r. j - u. 

(This and Uio abovn repriiiLcd by special permission from OUt Alpha Non-Verbal Test, Form A. (jopyright 
by World Book Company.) 


A fire always has 

Responses of “middles" 
Responses of "lowers” 


wood 

18% 

22% 

coal 

9% 

22% 

gas 

1% 

7% 

warmth 

69% 

38% 

furnace 

2% 

7% 

0% omit 
4% omit 

Which month comes 
just before March? 

Responses of “middles” 
Responses of “lowers” 

April 

12% 

15% 

September 

0% 

5% 

May 

7% 

February 

85% 

62% 

August 

0% 

7% 

2% omit 
4% omit 


(Tbis and the above from Iba Honmon-Nolaon Teels of Menial Ability, Blementary, [luiielilon MiOlin Com- 
pany.) 


“Find the word in the last of the list which tells what kind of a thing the first word in the 


list is.” 

salmon 

Responses of “middles” 

Responses of “lowers” 

(From the Kiiblmann-Andorson Test, by permiaaion of Educational Test Riireiiii.) 


meat 

water 

swim 

fish 

food 

12% 

0% 

1% 


10% 

15% 

3% 

10% 

47% 

17% 


2% omit 
8% omit 


“Mark imder the R for rights, and under tlie L for lefts.” 



Responses of “middles” 
Responses of “lowers" 


94%L 
83 %L 


R. L 

83%H 

81%R 


95 %R 
84%R 


(From tbo California Teel of Menial Malurily, Elementary Series. Publisbod with permiaaion of the California 
Test Bureau, Los Angeles, California.) 


Fig. 42. Responses of nine- and ten-year-olds from two social-status classes on 
intelligence test items (10). Items selected for this figure are not entirely represer- 
tative of the findings on the basis of all test items included in the research. 
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Table 27. A Form for Analyzing Tests 


Title: 

AuHiors: 

Publisher; 

General purpose of test; 
Group to which applicable: 


Date of publication: 

Cost, bookleh 
answer sheet: 

Time required: 
Forms: 


Mental functions represented in each score reported (from logical analysis); 

What author intended to measure; basis for selecting items; 

Basis for scoring (correction formulas, empirical weighting, etc. — if any): 

Evidence of empirical validity (criterion, number and type of cases, results): 

Comments regarding validity for particular purposes: 

Evidence of equivalence of scores (method of determining, number and type of cases, results): 
Evidence of stability of scores (method of determining, number and type of cases, results): 
Norms (type of norms reported, sample on which determined); 

Comments regarding adequacy of reliability and norms for particular purposes: 


Practical features of importance: 
Comments of reviewers: 

General evaluation: 

References: 


acquaint himself with particular group tests. To assist in making this study, 
the form presented in Table 27 is suggested. This outline is intended to point 
out every feature worth noting, and it may be that some items can be omitted 
when reviewing tests with a particular pmpose in mind. The form may be 
used for tests of any type. Most of the information called for by this form will 
appear in the test manual, if the manual is an adequate one. The manual, 
however, presents the test in a favorable light. A careful evaluation of all 
claims made is essential to bring to light the limitations of the test. Sources 
such as Buros will also be helpful, and both research and experience with 
the test contribute to evaluation. 

To demonstrate what a complete test analysis is like, the author has made 
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Arithmetic 


1. How many pencils can you buy for 50 cents at the rate of 2 for 5 cents? 

(a) 10 (b) 20 (c) 25 (d) 100 (e) 125 

Completion. Select the first letter of the word which fits the following definition. 

The thin cutting part of an instrument, as of a knife or sword. 

A B D H W 

Figure Analogies. Decide what rule is used to change Figure A to Figure B, apply this 
rule to Figure C, and select the re.sulting figure from the five at the right. 

A B C 1 2 3 4 5 


88 8 ffl cPDRdEB 

Same — Opposite. Select the word that is the same or the opposite of the word at left . 
ancient (1) dry (2) long (3) happy (4) old 

Numhei' Series. Select the next number for the series at left. 

10 8 11 9 12 10 9 10 11 12 13 

(a) (b) (c) (d) Ce) 

Verbal Arjalogies. 

cloth-dye house- (1) shade (2) paint (3) brush (4) door 


Fig. 43. Sample items from subtests of the American Council 
Psychological Examination. 


Table 28. An Analysis of the ACE Test 


Title; American Council on Education Psychological Examination for College Freshmen 


Authors; L. L. and T. G. Thurstone 
Publisher; Educational Testing Service 

General purpose of lest; to measure aptitude of col- 
lege freshmen 

Group to which applicable; college sludents 


Dale of Publication; see "Forms" 

Cost, booklet; 8fi 
answer sheet; 2f 

Time resfuired; 1 hour, plus administration time 


Forms; new form each August. (Another form for 
high school. SRA Verbal Form and Thurstone 
Test of Mentol Alertness are similar, shorter.) 


Mental functions represented in each score reported (from logical analysis); .... 

L, linguistic ability— Includes vocabulary knowledge and ability to reason with words; speed of reading 

may be important. , c j 

Q, quantitative ability— Includes skill and speed in arithmetic problems, reasoning with numbers, and a 

difficult non-verbal reasoning section. 
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Table 28. (Continued) 

Total score — Influenced both by past school learning and ability to attack unfamiliar problems. Speed 
IS important. 

Whaf authors intended to measure; basts for selecting items: 

Scholastic aptitude, with especial reference to cotiege curricula. 

Items chosen by factor analysis and preliminary trial, those having suitable difficulty and discrimination 
being retained. 

Basis for scoring (correction formulas^ empirical weighting, etc.): 

Score is number right. 

Evidence of empirical validity (criterion, number and type of eases, results): 

Numerous studies in various colleges show correlations with morks around .50. 

But some studies foil to support belief that Q best predicts engineering and science courses, or that L 
best predicts work in more verbal subjects (17). 

Comparison with Rorschach for 80 college girls shows that test reflects personality factors. Girls with L 
higher than Q have greater spontaneity; those wiHi higher Q have tendency toward precision, con- 
trol (21). 

Correlation of ACE with Binet Form L is .60, for 112 superior college freshmen (2). 

Commenfs regarding validity for particular purposes; 

Estimates probable college success as well as any mental test; Ohio State test equally suitable. Owing 
to greater difficulty, ACE is superior to generol-purpose tests with college groups. Owing to lack of cor- 
rection formula, it tends to reward hasty workers. Stow workers earn low scores, no matter how capable 
they would be with longer time. Low scores should be considered as Indications that student will do poorly 
unless given special help (perhaps remedial work In reading or arithmetic); low scores are not necessarily a 
sign of bask inaptitude. ACE probably not suitable for non-academic prediction. Further research on 
significance of Q and L scores is required. 

Evidence of equivalence of scores (method of determining, number and type of cases, results): 

Correlation between forms not published. 

Evidence of stability of scores (method of determining, number and type of cases, results): 
r ^ .83, for 286 college women tested as freshmen and seniors. 

Marked gains in score during freshman yeor (13). 

Norms (type of norms reported, sample on which determined): 

Each summer, national percentiles and means for all cooperating colleges for previous year's test ore 
published, without Identifying colleges by name. The 1946 norms were based on 86,212 cases In 317 
colleges. Preliminary norms, estimated from results with previous form, are published when each new form 
is released. 

Comments regarding odequacy of re/iabiirty and norms for particular purposes: 

Reliability is quite satisfactory for college groups. Pew other tests ore equally reliable for superior 
students. One-year log in publication of norms causes no serious difficulty. Test should be Interpreted with 
reference to local norms, since demands of different colleges vary. Separate norms by divisions of the 
college or university should be collecred. 


Practical features of importance: 

Adapted to machine scoring, which is essential for rapid processing of freshmen. Requires painstaking 
administration, owing to numerous timed subtesfs. New forms each year eliminate much practice effect, 
leaking of questions in advance. 

Comments of reviewers: 

"This is perhaps the test that one is likely to recommend to anyone who is looking for a 'good' intelli- 
gence test to give to o group of college freshmen. . . . Division of the score into a Q and an I score 
needs considerable study before its full significance Is apparent" (Commins, 8). 

"Constructed with high quality of workmonship. . . . Avoids the overweighting of one fundamental 
resource (verbal ability) to the neglect of others Hiat may be relatively more important In specialized 
curricula and courses. . . . The 'quanlltotive' part is a conglomerate factorlolly. ... It Is the reviewer's 
belief that the factors measured, except for the numerical, should be assessed by means of power tests" 
(Guilford, 8). 

General evaluation: 

An especially useful test for its purpose. Must not be treated as an ali-purpose mental test. Persons 
with low scores must be studied carefully. 


such a report on the American Coimcil Psychological Examination, Figure 
43 illustrates the types of items composing the subtests of the ACE test. 
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GROUP NON-VERBAL TESTS 

Non-verbal group tests have not been widely used in this country. It is 
rare that one has occasion to test an entire group of persons who are linguis- 
tically handicapped. Such situations do arise, as in militaiy processing of il- 
literates or in special schools for “non-academic” pupils. It is possible.that 
non-verbal- tests deserve wider application in industry than they have re- 
.ceived. 

The most widely known non-verbal test is Army Beta, which was used in 
World War I for those cases in which the verbal test, Alpha, was considered 
inappropriate. Dir ectio ns were in pantomime, which removed the effect of 
differences in language ability from the score. In World War II, the Group 


Table 29. Representative Group Non-Verbal Tests 


Test and Publisher 

Age Level 

Typical Tasks, 

Speeiol features 

Comments of Reviewers" 

CMM Non-Language 


See Table 24 


(Calif. Test Bureau) 

Cattell Culture-Free 

MA12to 

Classification, figure series, 

"Relatively uninfluenced by cultural 

Test (Psyeh. Corp.) 

superior 

adult 

etc. Nearly unspeeded. 
Supposed free from school 
Influence. Range covers 
superior groups 

factors, highly saturated with gen- 
eral Intelligence" (Shipley, 8). 

"Little evidence to show test free from 
over-oil cultural influence" 
(Wechsler, 8). 

Chicago Non-Verbal 

Age 6 to 

Digit symbol, block count- 

"Not reliable enough to use In com- 

(Psych. Corp.) 

adult 

ing, etc. Con be used with 
pantomime directions 

paring a child with his classmoles. 

. . . Norms tentative" (Bernreoter, 
7rP.2n). 

"Wider range of mental functions 
[than competing tests]" (Pignatelli, 
7,p.212). 

Pintner General 

Grades 

Paper folding, reasoning 

"Weil constructed. . . . Best test In 

Ability Tests; Non- 
Languoge (World 

4-9 

with forms, etc. Panto- 
mime directions possible 

its field" (Whitmer, S). 

Book) 

Revised Beta (Psych. 
Corp.) 

SRA Non-Verbal 

Form (Scl. Res. 
Assoc.) 

Age 16-60 

Adolescents 

and 

adults 

Maze, digit symbol, picture 
completion, etc 

Picture classification (see 

Fig. 10) 

"Most of the subtests measure the 
same things that verbal tests do 
except that the subject is given 
pictures and symbols, not words" 
(Wechsler, 7, p. 242). 


“ The render should refer to the source cited for a full stnlement of the reviewer’s opinion. 


T arget T est flUed-fee-same function. There have been a variety of civilian 
non-verbal tests, each offei’ing new item forms, superior standardization, or 
other minor changes. Most recently, non-verbal materials have been used as 
sections in diagnostic or semi-diagnostic batteries. Thus, the CMM Non- 
Language section is a largely non-verbal test and can be given separately, but 
it is more often given as part of a general mental test. Some other tests having 
non-verbal sections are listed in tlie next chapter. 



188 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


Development of non-verbal tests has been hampered by the difficulty of 
eliminating language completely, since directions are difficult to pantomime. 
Another is that most non-verbal items dp not call for .very complex _m enta l 
processes. With the exception of “spatial” items which measure ability to 
perceive relations among forms, most non-verbal itejns are too easy to meas- 
ure superior adolescents and adults. Some recent tests have demonstrated 
ingenious solutions to this problem. But if “higher intelligence” involves abil- 
ity to do abstract thinking, it is hard to conceive of an adequate high-level 
test that does not involve vocabulary and concepts. 

Table 29 lists several current non-verbal tests. These are primarily us ed for 
clinical purposes. If a pupil is a poor reader, he will do badly on most verbal 
tests; if he does much better on a non-verbal test, he is a good prospect for 
remedial reading, since his problem seems to be a verbal handicap rather 
than lack of mentality. Delinquents who have done badly in school may show 
significantly better performance on non-verbal tests than on verbal tests; 
such a finding dramatically changes the treatment one would pre.s’cribe. Like 
performance tests, non-verbal tests are important in intercultural compari- 
sons. 

On the whole, tlie empirical validity of non-verbal tests has been only mod- 
erate. They do not correlate highly with verbal tests or with school success. 
This is to be expected, but a test which is not known to correlate with some 
other variable has little practical value. Criteria of success outside of school 
might correlate with non-verbal tests, but evidence has not been presented 
to establish this as a fact. 

35. In a Veteran’s Counseling Center, adults with varying educational back- 
grounds and vocational goals must be given advisement. Prepare what you 
consider a minimum list of intelligence tests needed to cope with all non- 
psychiab'ic cases. 

36. Prepare a minimum list of intelligence tests needed by a school psychologist 
who is expected to diagnose any pupil, age 6 to 16, whose behavior or school- 
work is considered unsatisfactory. 

AN EVALUATION OF INTELLIGENCE TESTING 

Now that the reader is acquainted with the range of approaches used in 
studying general intelligence, it is appropriate to ask how adequate these 
mental tests are. Tests of general intelligence have been found useful in all 
types of psychological practice and research, but for some problems our 
present armamentarium is far from ideal. 

O ne great gap is in tests of high-level general mental ability. The Weehs? 
ler test and tire Binet tests have low ceilings, and are undependable- in dis- 
tinguishing between the superior and the very, superior within, for.example, 
college-graduate -populations. The wav to. distingui sh between the high 
grades of ability is to introduce, more dijBculL items, but this is usually ac- 
complished, as in the ACE test, by adding items depending upon advanced 
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education rather than items calling for superior adaptive ability. It is hard 
to invent a test item which only one person in a thousand can pass, unless it 
calls for experiences that only highly educated people have. There is need 
for Jiigh-level tests in selecting graduate students and in choosing among 
employees in such superior groups as junior executives and engineers. 

A second major limitation is the scholastic emphasis of present tests. Ef- 
forts to test types of thinking characterized by logic, accuracy, and knowl- 
edge have been very successful. There has been little success in testing the 
types of thinking which distinguish the novelist from the engineer or the 
clinical case worker from the museum curator. This is probably of minor im- 
portance at younger’ ages where all abilities seem to be highly correlated, al- 
though even here we must occasionally overlook a child of high potential 
along lines not stressed in our tests. In vocational guidance, we have little 
basis for judging which man will be most insightful, most creative, and most 
original, except to assume that these traits are correlated with high vocab- 
ulary or ability to solve analogies. 

We also^n eed bet ter non-verbal tests. Non-verbal tests are much the same, 
and fe w of them re quire above-average-adult thinking. If one can have high 
mental ability, all tasks considered, without having exceptional verbal ability 
— as Thurstone’s studies discussed in the next chapter imply — many people 
have been rated unjustly by our present tests. Mental tests do not have to be 
mhal- The stress in our culture on verbal education as the key to advance- 
ment has set the.pattern for our tests. But the opposite situation is found in 
laban. The written Japanese language is so diflScult AaFif requires most of 
si x years of schooling t o become literate. Most Japanese, before the war, left 
school at the sixth or eighth grade, so that only die privileged elite developed. 
their verbal ab ilities fully. To test school children and prospective employees, 
Japanes e p.sychologists were forced to make nearly all their group tests non- 
verbal, since dillerences in verbal ability primarily reflected educatio n. Amer- 
ican t esters have tended to a ssume that verbal differences should be stressed 
in tests, a statemeirTwhich is unarguable only so long as the performances 
To he predicted are.verhal in nature. , 

As against these persistent stumbling blocks, attention may be drawn to 
the steady progress that has been made in efiBcient test design, to the great 
growth in awareness of limitations of tests which has improved interpreta- 
tion, and to the recent emergence of tests which call the clinician’s attention 
to how the subject thinks, rather than how many points he can earn. More 
effective diagnosis of the facets of mental ability is promised by emerging re- 
search reported in the next chapter. We have come a long way from Wissler’s 
trying to correlate college success with speed of canceling as, and from 
Binet’s wondering whether phrenologists had anything to contribute to tlie 
identification of mentally inferior children. The user of today’s tests can be 
sure that the people they select for him will be a clearly superior group. 
The fact that the tests sometimes do injustice to an individual is a problem 
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to which, in a democracy seeking to give everyone the opportunity he needs 

for development, we must give conscientious attention. 
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CHAPTER 9 


Factor Analysis: The Sorting of Abilities 


In previous chapters, attention has been given to measures of general mental 
development. Within those tests, however, there were many attempts to 
break down this all-over rating to describe the particular tasks at which the 
subject is best and poorest. General intelligence by no means tells the whole 
story about a person’s endowment. He has certain knacks or talents which set 
him apart from those of equal general ability; these specific aptitudes are of 
great importance in guidance and personnel selection. 

Aptitude tests first appeared in liit-or-miss fashion, as a psychologist in- 
terested in a particular performance such as music, typewriting, or assembly 
of small parts developed a device which he hoped would fit liis needs. Often 
these tests failed to predict the task under consideration and though often 
labeled as “aptitude” or “prognostic” tests were essentially general intelli- 
gence tests. Recent studies are beginning to provide a more .systematic 
method of studying aptitudes. Instead of cataloging all the specific tests that 
have been tried, it will be profitable to consider the major independ ent a bili- 
ties in broader groupings. 

Factor analysis is the tool used in soi-ting out the abilities of man. It helps 
separate special aptitudes from general intelligence, and permits grouping 
of tests which overlap. Understanding factorial thinking is essential, as it 
underlies a large amount of current test construction. Because factor analysis 
is a diflBcult topic, only the simpler aspects of the inetliod can be presented 
here. 


THEORY OF FACTORIAL ANALYSIS 
Interpreting Sets of Correlations 

Conclusions about what a test measures are readily drawn from its, cor- 
relations witli other tests. A table of intercorrelations permits us to study the 
relations among a set of tests. Any correlation indicates that tests possess a 
common element. X,he^cqmmon element that causes testsjo be correl ated is a 
factor. Binet applied such reasonmg when he decided that his tests, all hav- 
ing a substantial relation to each other, must all be influenced by the same 
common factor, general intelligence. Gaw used the same reasoning to con- 
clude that different performance tests measured tire same thing (see p, 165 ). 
Wissler, who found that his tests had veiy small intercorrelations, concluded 
correctly that they had very little in common and therefore represented 
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different abilities (see p. 102). The factor concept can be illustrated by 
means of a series of problems. 

TabiiSO. .presents correlations among three Navy .classification tests. One 
should arrive at two conclusions from these data: (1) Because the correla- 
tio BS are ge nerally positive, the tests must be affected by some common char- 
acterise. (2) Tests A and B appear to have more in common than either has 
in common with test C. The reasonableness of such a result is clear when we 
find that A is the General Classification test, B the Reading test, and C the 
Arithmetic Reasoning test. Probably the common element in all three tests is 
general ability, schooling, and past learning. But the two verbal tests would 
certainly have more in common than either would have in common with a 
test strongly influenced by aptitude or training in mathematics. A formal 
factor analysis goes beyond this inspection and estimates how much each 
test is influenced by the common factors. 


Table 30. Intercorrelations of Three 
Tests for Navy Recruits (3)“ 



A 

B 

C 

A 


,81 

.69 

B 



.69 

C 





“ To simplify tables, each correlation is nre- 
sentud only once. The correlation of A with B 
(or B with A) is .81. Symmetrical entries could 
he made below the diagonal if desired. 


Table 31. Intercorrelations of Four Attitude Tests Given to Young 

Adults (14) 


Economic 

Family low EducoMon Conservatism 

Attitude Toward Fomily -39 .30 .08 

Attitude Toward Law 
Attitude Toward Education 
Economic Conservatism 


A different factor description is needed for the attitude tests whose inter- 
correlations appear in Table 31, This time there is no substantial common 
element in all four tests. "Economic conservatism” seems to be nearly inde- 
pendent of the other three variables. “Family” “Law,” and Edixcation 
scores have positive intercorrelations and may be said to reflect a common 
factor. Frequently it is desirable to guess at llie nature of the underlying 
factor and to name it. The correlations suggest that those who are critical of 
the family are also critical of the legal system and of educational institutions. 
Those who favor one of these tend to favor the others. Therefore, a proper 
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name for the common factor might be “social conservatism,” or “acceptance 
of present social institutions.” In Table 31 the correlations are lower than in 
Table 80. The relatively low intercorrelations show that in addition to the 
common factor each of the first three tests includes some trait not coinmon' to 
the other tests. 


Table 32. Intercorrelations of Six Navy Classification Tests for Recruits (3) 



General 

Classification Reading 

Arithmetic 

Reasoning 

Mechanical 

Aptitude 

Electrical 

Knowledge 

Mechanical 

Knowledge 

General Classification 

.81 

.69 

.60 

.53 

.49 

Reading 


.69 

.56 

.51 

.46 

Arithmetic Reasoning 



.61 

.47 

.41 

Mechanical Aptitude 




.53 

.55 

Electrical Knowledge 





.78 

Mechanical Knowledge 







1. Table 32 presents correlations between tests of the Navy classification battery. 
Does there appear to be a single common factor among all these tests? If so, 
what might be its psychological nature? 

2. Which pairs of tests seem to have the greatest common element, or overlap? 

FACTOR THEORIES OF INTELLIGENCE 
Factor analysis has clarified theoretical and practical problems. It helps to 
reduce duplication in testing programs where tests with different names ac- 
tually rrieasure the same trait. It reduces a conglomerate ot psychological 
tests to ordered families. In one famous study, Thurstone found that most of 
the individual diflFerences affecting scores on 56 mental tests could be ex- 
pressed in terms of only. 9 factors (8). Perhaps t he g reatest contribution of 
factor analysis has been to bring system to the study of the nature of in- 
telligence. 

Intercorrelations permit us to identify thi-ee types of .elements in tests: 
general factors, group factors, and unique factors. A general factor is a'faetor 
present in all the tests in a given set. A group factor is a factor presehTTn 
several, but not all, of the tests under consideration. A unique factor is a 
characteristic which influences score on only one of the tests (Fig. 44). Gen- 
eral factors are present for the tests in Tables 30 and 32, but in^Table 31 the 
.ge.aeral factor is negligible. A group factor is found among three t est s in 
Table 31. Every test is probably influenced by some unique factor, but the 
Economic Conservatism score in Table 31 most clearly illustrates a test with 
a large specific content. In recent studies of the organization of abilities, the 
central question has been to identify tlie general, groujp, and unique factors 
in ability tests. 

In..th&days-o£ .“faculty psycllQlogy” men were described as possessing good 
or poor memory, reasoning, imagination, and so on. These powers were 
thought of as largely independent of each other, but of course tests of the 
same function were expected to have a great deal in common. This was a 
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Suppose we have just one test, which we will represent a diamond shape. Then we 
cannot assume the presence of common factors with other tests. We will use an un- 
shaded area to represent unique factors. 



Now suppose a second tesl w added. Then the following patterns are possible. The 
shaded area I’epresents a general factor, common to both tests. 


All factors may be 
imic|ue. 



There may be both gen- 
eral and unique factors. 



The tests may measure 
the general factor and 
nothing else. 

♦ 


The possible pallerns are even more varied ij we add a third test. This time a general 
factor, present in all t hree testa, is shown by a heavily shaded area, and group factors 
are shown by lightly shaded areas. 

There may be gen- There may be only 
All factors may be There may be group eral, group, and general and unique 
unitiue and unique factors. unique factors. factors. 



With three teats, it is also possible to have general and group, but no unique factors; 
group factors alone; or a general factor alone. The.se are unlikely. 

Fig. 44. Possible factorial relations among tests. 


group-factor theory, a theory that abilities fall into groups which are largely 
distinct from each other. Studies like those of Wissler tended to break down, 
this theory by show ing tliat even, tests within the same group did no t m easure 
the same ability. If differeii t me mory tests measur e enti rely different things, 
we mus^retoeat to tft Eeory of unique abilities. The problem then would be to 
catalog, as independent abilities, such traits as mem9ry for digits, memoi^ 
for poetry, memory for designs, and so on. There would, of course. Trie an 
overwhelmingly l ong list of such subdivisions. Binet led a march away from 
the nhiq iTelind group the orie.s of mental organization with his search fpr a. 
general , pervading “ intell igence." As we have seen, this was justified by his 
evidence that a large number of tests dealing with significant abilities had a 
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considerable common element. While the concept of intelligence as a general 
characteristic found in most mental performances has been of great useful- 
ness, it has never been entirely satisfactory. Binet’s test and every other have 
shown that mental development is not uniform, that no test is a pure measure 
of general intelligence alone. Recent research has emphasized a search for 
these, other elements which affect test performance, in addition to Binet’s 
general factor. 

The theory of general intelligence has been most effectively stated by 
Spearman. He conceived of abilities as divisible into g (the general factor) 
mid many unique factors found only in single tasks. Many investigators, espe- 
cially in England, have used factor analysis in order to discover the nature of 
g by identifying tests saturated witli it. According to Spearman ( 16, p. 411 ) : 
“The g proved to be a factor which enters into the measurements of ability 
of all kinds, and which is constant for any individual, although yaxyjPg 
greatly for different, individuals. It showed itself to be involved invariably 
and exclusively in all operations of eductive nature, whatever might be the 
j:lass of relation or the sort of fundaments [content] at issue. It was found to 
be equally concerned with each of two general dimensions of ability, Clear- 
n.ess and Speed. 

Spearman s theory has been modified to account for the fact tliattiome sets 
of miStal tasks have common elements not present in other tasks,. The Hol- 
zinger-Spearman method, known as,Sji-factor” analysis, seeks to divide a set 
”of tests into tlieir general, group, and unique components ( 10 ) . 

A competin g theoru-holds that it is most efficient to describe i nt e Uig ence 
as -broken ijitp psychologically separate . hundrerdf aptitudes, which can b e 
se parate ly measured. This theory does not require the clusters to be entirely 
indepe.ndsilt. Although the theory superficially resembles faculty psychology, 
it accepts none of the faculty p.sychologists’ ideas as to the nature and source 
of abilities. This group-factor emphasis is represented by Thurstone’s “multi- 
ple-factor”. techniques ( 19 ) . The relations between the general and groujo 
theories of ability will be clarified in the sections to follow. 


3. In different methods of inteipredng performance, the Wechsler test yields 
information about a general factor, about group factors, or about unique factors 
present in only one subtest. Demonstrate the tmth of this statement. 

4. Confidence may be manifested in a variety of situations: making a speech to a 
women’s club, taking one’s car apart to repair it, piloting a fighter plane, 
going to a show instead of cramming for a test, and so on. Give three alternative 
explanations of the nature of confidence: one in which it is considered as a 
general factor, one in which it is divided into group factors, and one in which it 
is considered as a number of highly specific factors. Which theory do you think 
is most adequate? 

5. If confidence is to be measured for selecting future fighter pilots, how would 
a psychologist test confidence for this purpose if he believed it to be a general 
trait? How would he proceed if he considered confidence to be specific to a 
particular situation? 
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Patterns of Factors 

Before stuctying the evidence from factor analysis, it will be helpful to 
note two of the kinds of factor descriptions theoretically possible. Hypo- 
thetical correlations are used in the following illustration. 

Table 33_gives intercorrelations for five tests and shows two ways of de- 

Table 33. Factorial Descriptions Appropriate for a Set of 
Five Tests 


Correlations 





1 2 

3 


4 

5 





1 .40 

.50 


.35 

.20 





2 

.30 


.25 

.10 





3 



.35 

.20 





4 




.50 





5 







A. 

Bi-Factor Structure 



B. Multiple-Factor Structure 




Factors'* 





Factors" 

Test 

a 

b 

e Unique 


Test 


b' 

c' Unique 

1 

XX 

X 

Oi 


1 

XX 


X U, 

2 

X 

X 

Us 


2 

X 


X Us 

3 

XX 


Us 


3 

XX 

X 

Us 

4 

X 


X 0. 


4 


XX 

u. 

5 

X 


XX U« 


5 


X 

Us 


" Entries in the factor description indicate which tests arc influenced by each Factor; x in- 
dicates a moderate influence, xx indicates a marked influence. The complete factor pattern 
of anv teat is determined from the correlations by complex statistical procedures not dis- 
cussed in this book. 


s cribing the test s factori^lyj^ The first pattern is a bi-factor description. Ac- 
cording Jp Table 3 3A. Test 1 is made up of a heavy loading of the general 
factor (a ) plus some of a group factor (b), and its unique factor ( l/j ) . Math- 
ematically it is equally sound to divide the tests according to the factor plan 
shown in Table 33B. This is a mult^le-factor design and groups the tests in.. 
su ch a way tha t nQ_general factor is reported. This time, Test 1 is found to 
contain two group factors (a') and (c'), and its unique factor (Ui). 

The investigator in any study chooses whether he prefers to use a bi-factqr 
or a multi^ e-factqr dgseription (or one of the several alternative methods 
suehTTthose of KeUey and Hotelling). The choice is determine d in part by 
the dntg , and, in part the description that would be convenient for a par- 
ti cular purpose. The problem is not discussed fully here, but two basic prin- 
'SplgajaxBi 

1. A general factor can always be obtained, if desired, when all the cor- 
relations in a matrix are positive. 

2. One type of factor solution can be converted into another by appropri- 
ate algebraic methods ( 9 ) . 

This m ean.s that one can so metimes prove the absence of a general factor. 
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If the correlations permit a general factor, we may isolate it or not as we 
choose. For sets of positive intercorrelations, neither factor description is 
more "right” than the other. The difference between them is one of emphasis' 
and convenience. 


How Factor Analysis Groups Tests 

A simple example of how factor analysis permits regrouping of tests will 
clarify its application. Peterson factored the Navy classification scores from 
the General Classification, Reading, Arithmetic, Mechanical Knowledge, and 
Mechanical Aptitude tests (13). These were tests designed by psychologists 
to measure each recruit on the types of abilities most important in placing 
him in a suitable berth. Because classification interviewers wished diagnosis 
of strengths and weaknesses, tests were broken into many subscores. There 
were twelve scores which had been used separately or considered for such 
use. The results of a multiple-factor analysis are found in Table 34. 


Table 34. Factor Anolysis of Navy Classification Test Scores (13) 


Test 

Subdivision 

1 

Factor Loading'* 

II III 

Unique 

Reading 

Reading 

.70 

0 

0 

X 

General Classification 

Opposites 

.76 

0 

0 

X 


Analogies 

.73 

0 

0 

X 


Series Completion 

.68 

0 

X 

X 

Arithmetic Reasoning 

Arithmetic Reasoning 

.56 

0 

X 

X 

Mechanical Knowledge 

Tool Relations 

0 

.69 

0 

X 


Mechanical informoHon 

X 

.59 

0 

X 


Electrical Comprehension 

X 

.67 

0 

X 


Mechanical Compr^ension 

X 

.64 

0 

X 

Mechanical Aptitude 

Block Counting 

0 

0 

.61 

.64 


Mechanical Compr^enslon 

0 

X 

.52 

X 


Surface Development 

X 

X 

X 

.65 


« X indicatL\s factor londinR between .20 and .50; 0 represents negligible loading^ below ,2,0, 


The analysis shows only three common factors in the twelve tests. This of 
itself does not prove that there are only three abilities involved; perhaps 
each test makes a unique contribution, measuring an important independent 
factor. But tlie unique-factor loadings are too small to warrant using any 
tests to measure unique factors except Block Counting and Surface Develop- 
ment. Peterspiij jhen, showed that most of the measurement in the twelve 
scores can be reduced to five scores which provide the same information 
without duplicatiojij, Factor I, Factor II, Factor III, aiid two unique factors 
are war ranted. If a reliable test for each factor is designed, seven scores could 
be eliminated from the record a classification interviewer need think about. 
Peterson recommended that GCT and MK not be broken into subtests, since 
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Psychomotor 
Precision 2 % 


Unique Factor 6 % 


Reasoning 

3% 


Numerical Operations, a neariy pure test* 
Subtraction and division exercises. 


Directional Plotting, a mixed test. 
Reporting directions between two points, 
when given their coordinates on a chart 


Mechanical Exper...3% 

Visualization 3% 

Reasoning III 2% 

Spatial 111 2 % 

Judgment .1% 

Math. Background. IX 

Planning .IX 

Psychomotor Precision IX 



Perceptual Speed 4X 
Pilot Interest 3% 


Misc. Common Factors 3X 


Complex Coordination, a highly complex test 
Job replica apparatus test. 

Fig. 45. Tests of different factorial purity. Factor loadings were determined 
by analysis of a complete battery of AAF classification tests (7, pp. 828-831). 
The most prominent factor in each test is shaded. 


the subtests within each measured nearly the same thing. The logic is iden- 
tical to that introduced in Chapter 7, when it was pointed out that subtests 
having high correlations are rarely suitable for diagnosis. 

The reader may be puzzled by the factor weights in Table 34. The weights 
for Block Counting, for example, add to over 1.00. It is necessary to explain 
that the weights, as written, do not represent the percentage breakdown of 
the tests. If the factors are uncorrelated, that -bre^down is represented by 
the equation: 

1 = + . . . + U" + (1 - rn) 
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etc. are the factor weights for the various common factors, U is the 
unique-factor weight, and rj, is the coefficient of equivalence (reliability). 
If the reading test has a reliability of .85 (1 ), it breaks down as follows:* 

15% unreliability, or error of measurement (1 — .85) 

49% Factor I, probably scholastic ability (.70°) 

36% unique factors (remainder; 100% minus 15%, 49%) 

100 % 

The percentage of the reliable portion of the test that is accounted for by the 
most prominent factor is called the saturation or purity of the test. For the 
Reading test, 49 percent out of 85 percent (the reliable portion) represents 
Factor I. There is 58 percent saturation (49 -h 85). 

Many investigators are seeking to develop saturated tests of various mental 
factors. It is contended that combinations of such pure tests permit more ac- 
curate predictions than tests which are impure. Figure 45 shows the factorial 
make-up of tlu'ee tests tried by the Air Force. In tins study a tremendous 
number of factors were i.solated mathematieally. Some tests, such as Numer- 
ical Operations, were found to be quite pure, whereas others combined a 
great many different abilities. 

6. In Table 33A, describe the factor composition of Tests 2, 3, 4, and 5. 

7. In Table 33B, describe the factor composition of Tests 2, 3, 4, and 5. 

8. Diagram the factor sti'ucture of Tables 30 and 31, according to either the bi- 
factor or the multiple-factor pattern. Do not attempt to determine factor 
weights. 

9. Compute the percentage conbibution of each factor to the Reading test, 
assuming a reliability of .75 instead of .85. 

10. What tests could be discarded, among the five in the Navy battery, without 
losing any factor now being reliably measured? 

A Typical Factor Analysis 

Factor analysis has been applied to aptitude, attibide, and personality 
variables. In the following section, an elaborate study of tests of mechanical 
ability is presented which both illustrates the results to be obtained by the 
method and leads to helpful conclusions regarding mechanical aptitudes. 

This study was made by Harrell in an attempt to clarify the natm'e of. 
“mechanical ability” (8). In 1940, when his work appeared, there were many 
tests for selecting workers for mechanical jobs, based on many assumptions 
about the nature of mechanical abilities. Were these tests, including meas- 
ures of motor speed, coordination, visual perception, mechanical knowledge, 
and so on, measuring tlie same thing? If not, how many basically different 
sorts of ability were represented in the tests? Harrell gave 31 tests to 91 
cotton-mill machine fixers, and supplemented the tests with other data. He 

'•Peterson’s three factors have small intercorrelations (.19 to .31), hut the formula for 
uncorrelated factors may be used here, since die Reading test contains only one of the 
factors. 



FACTOR ANALYSIS: THE SORTING OF ABILITIES 
Table 35. Factor Loadings of Mechanical Ability Tests (8) 


201 


Faclor Loadings 



Variable 

Type of Performance 

I 

Per- 

ceptual 

II 

Verbal 

III 

Youth 

IV 

Agility 

V 

Spatial 

1 

Youth 


-.012 

-.038 

iffn 

■mn 

.068 

2 

Schooling 


-.059 

.456 


BriFiJ 

.079 

3 

Inexperience, Mech. 


.033 


Brffll 


.082 

4 

Minn. Spatial, acc. 

Puts cutouts in holes 

.392 


-.043 


-.178 

5 

“ , Speed 

II i« it 

.251 


.058 

-.018 

.275 

6 

" Assembly 

Puts together mechanical de- 








vices 

.415 

-.190 

-.026 

-.163 

.383 

7 

" Pa p er Form Boa rd 

Visualizes cut-up forms 








(Fig. 50) 

.228 

.052 

.142 

-.176 

.517 

8 

" Interest Blank 


.016 

.149 

.267 

-.005 

-.007 

9 

" Routine assembly 

Same os 6, without novelty 

.624 

-.026 

.068 

-.004 

.154 

10 

“ Routine stripping 

Same as 9, taking objects 








apart 

.457 

.001 

-.090 

.091 

.044 

11 

Whitman pinboard^ 








preferred hand 

Puts pins in holes 

.295 

-.036 

.193 

.357 

-.125 

12 

Whitman pinboard, 








non-pref. hand 

II II II It 

—.005 

-.185 

-.160 

.380 

.042 

13 

Whitman pinboard, 








both hands 

II II II II 

-.076 

-.187 

-.013 

.505 

.137 

14 

Whitman pegboard 

'• « »• 
pegs 

.384 

.059 

.450 

.270 

-.063 

15 

Whitman pegboard. 

Removes nuts from bolts 

.123 

-.006 

-.135 

.426 

.072 


disassembly 







16 

Whitman pegboard, 








assembly 

Puts nuts on bolts 

.319 

.007 

-.055 

.378 

-.130 

17 

Whitman Peg sorting 

Sorts colored pegs Into boxes 

.392 

.110 

.383 

.065 

-.051 

18 

O’Connor Wiggly Block 

Puts together cutout block 

.230 

-.050 

-.073 

.060 

.332 

19 

Crockett assembly 

Puts nuts on bolts 

.248 

-.018 

-.174 

.430 

.058 

20 

" Block packing 

Puts blocks in box 

.128 

-.002 

.113 

.384 

.146 

21 

" Block laying 

Puts blocks in row 

-.078 

-.150 

.200 

.180 

.212 

22 

MacQuarrie Tracing 

Traces narrow path 

.167 

.094 

.336 

.134 

.065 

23 

" Tapping 

Makes dots rapidly 

-.015 

.367 

.121 

.322 

-.001 

24 

" Dotting 

Makes dots precisely (Fig. 47] 

-.111 

.035 

.117 

.501 

.043 

25 

" Copying 

Copies figure by coordinates 








(Fig. 48) 

-.015 

.109 

.007 

.035 

.494 

26 

" Location 

Locates items by coordinates 

.049 

.305 

.129 

-.059 

.352 

27 

" Block count- 

Counts hidden blocks (Rg. 48] 

.045 

.077 

-.008 

-.062 

.540 


ing 







28 

MacQuarrie Pursuit 

Traces line through tangle 








pattern (Fig. 48| 

-.080 

.058 

-.077 

.077 

.515 

29 

Stenquist 1 

Identifies tools 

.397 

.179 

.009 

-.003 

.222 

30 

Thurstone Spatial Reia* 

Identifies rotated figures 







tions A 

(Fig. 10) 

.238 

.004 

.001 

.010 

.327 

31 

Thurstone Spatial Rela- 

Identifies rotated figures 

-.066 

.117 

-.065 

-.026 

.487 


tions B 







32 

Thurstone Punched holes 

Visualizes paper folding 

-.134 

.086 

-.054 

.034 

■ I 

33 

" Opposites 

Verbal opposites (Fig. 43) 

-.044 

.547 

-.014 

.047 


34 

" Analogies 

Verbal analogies (Fig. 43) 

-.048 

.480 

.033 

.014 

1 ij 1 

35 

“ Completion 

Vocabulary (Fig. 43) 

.046 

.561 

-.071 

-.073 

■ j 

36 

Poor general rating 


-.155 

-.257 

.321 

-.028 

.180 

37 

Poor mechanical rating 


-.082 

-.196 

.491 

-.043 

-.020 


Loadings of .35 or over in boldface type. 
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then obtained the 666 intercorrelations for 37 variables. This overwhelming 
array of data is in striking contrast to the results of the factor analysis, which 
reduces this mass of data, two pages hence, to only two brief sentences. 
The virtue of factor analysis is its ability to extract the significant elements 
from such a table of intercorrelations. 

The factor loadings for the tests are reproduced in Table 35. The factor 
analysis was made by a multiple-factor method, since the absence of a gen- 
eral factor is suggested by the original correlations. The analysis shows five 
common factors. The factors themselves are somewhat correlated. Factors 
II, III, and V are influenced by a common element, probably general mental 
alertness. Factors I, IV, and V have some factor in common, probably motor 



Fig. 46. Tests measuring Factor I, Perception. (Courtesy, Marietta Apparatus Co.) 


Speed. The problem of factors which are only partly independent may be 
clarified by an analogy. Each factor represents a cluster of tests like a human 
family — all closely related. But different families in a tribe also may have 
some relationship — less close tlian within the family. Then, of course, there 
may be tribes with no common blood between them. 

Factor I was found to be present particularly in test 9, routine assembly; 
10, routine stripping; 6, assembly; 29, Stenquist I; 4. spatial relations ac- 
curacy; 17, peg sorting; and 14, peg sticking. Harrell identifies the probable 
meaning of this factor by the following reasoning (8, p. 22) : 

“None of the tests with high factor loadings are verbal. This factor seems to be 
Thurstone’s perception (P) factor, even though none of the Thurstonc perceptual 
tests are present in the above list. Thurstone writes that: ‘The tests that call 
for this ability require the quick perception of detail in either vi.siia] or verbal 
material. . . .’ Dne reason for deciding tliat this is not essentially a manual factor 
is because Factor IV is definitely manual. The Factor I identification tests differ 
from those of Factor IV in that they demand more discrimination and are less 
routine.” 

Factor II had its highest loadings in test 35, completion; 33, word op- 
posites; 34, word analogies; and 2, schooling. This was identified as a verbal 
factor. 

Factor III appeared in three non-test variables ( 1, youth; 3, inexperience; 
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and 37, poor mechanical rating) and two tests ( 14, peg sticking; and 17, peg 
sorting ) . The reasoning used by Harrell to describe this factor calls attention 
to significant elements in test validity (8, p. 23) : 

"The two variables with the highest weights. Youth and Inexperience, identify 
this factor as Youth (Y). The high loadings in the two tests may be due to a 
youthful willingness to follow directions, or to a deterioriation with age. The for- 
mer explanation is preferred. Why just these two tests and no other should be 
highly weighted with Y may be understood by remembering that sticking and 
sorting the pegs according to their colors impress the machine fixing sub- 
jects as more childish than other tests. It looked fairly sensible, for example, to 
the machine fixers to put together a spark plug or a pair of calipers and there 
the older fixers would cooperate, but they could not understand why they should 
hasten to put red pegs in one tray and purple pegs in another tray. The younger 
fixers, however, were more impressed with the entire testing program and could 
not l^e dis.suaded from believing that their re.sponsiveness would influence their 
future chances of promotions.” 



Fig. 47. MacQuarrie Dotting subtest, o measure of 
Factor IV, Agility. (By permission of California Test Bu- 
reau.) The subject dots as many circles as he can in the 
time allowed. (Shown here one-half size.) 

Factor IV included test 13, pinboard with both hands; 24, dotting; 15, 
taking nuts off; 20, block packing; 12, pinboard with non-preferred hand; 
and 16, putting nuts on. This factor involves manual dexterity or agility. 

Factor V is found in test 31 and 32, spatial tests; 27, block counting; 7, 
paper formboard; 28, pursuit; 25, copying; and 6, assembly. Nearly all these 
are pencil-and-paper tests requiring perceptual reasoning. This is recognized 
by Harrell as Thurstone’s spatial factor (S), which Thurstone relates to 
"tests which require the subject to think visually of geometrical forms and 
of objects in space.” 

To summarize: -The 37 scores yielded 5 group factors— visual perception, 
verbal, youth, manual agility, and spatial. There are two more pervasive 
group factors overlapping and linking these, namely, general mental alert- 
ness and motor speed. 

-Tlie usefulness, of such factor analysis is in giving a clearer picture of 
what tests, are Pleasuring. While tliis analysis, does not bring out theJim 
portance of unique _factors, it does suggest which tests tend to duplicate 
measurement of the’ same factors. Thus Harrell points out that each factor, 
pres ent in the mor e cumbersome apparatus tests is alsp,inea§ured by some 
pe ncil-and-p aper test. 
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MacQuarrie 
Block Counting 



MacQuarrie 

Copying 



MacQuarrie 

Pursuit 


Fig, 48. Tests measuring Factor V, Spatial. 

(By permission of California Test Bureau.) 

11. Suppose you wished to assemble a battery which included one test to meas- 
ure each of Harrell’s live factors as “purely” as possible. Which of the 37 
tests would you use? 

12. Describe each of the following tests in terms of Harrell's factors: Minnesota 
Paper Foi-m Board (7), interests (8), dotting (24), and location (26). 

13. Putting nuts on bolts (16). and removing them from bolts (15) seem to he 
faii'ly similar tasks. Can you explain why they have diffei'cnt factor composi- 
tion? 

14. After having tlie workers assemble novel tools (6), Harrell had them repeat 
the procedure which was now familiar (9). Account for the shifts in the 
factor loadings. 

15. Describe in words the difference between Factor I (Perceptual) and Factor 
V (Spatial). 

16. The MacQuarrie subtests measure at least two different factors. What does this 
suggest regarding the use of this test for vocational guidance? 

17. mich tests have low weights in all five of Harrell’s factors? What does this 
indicate about their possible u.sefulness for measuring mechanical aptitudes? 

PRIMARY MENTAL ABILITIES 
Thurstone's Investigations 

Hayrng-seeii-hew-JElarrell divided the heterogeneous domain of mechan- 
ical ^.ilitv into the most camm^yaneasured factors, we are now in a posi- 
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tion to examine comparable work with mental tests. Factor studies of 
intelligence were begun by Spearman and his associates in England. An 
American pioneer in this ai'ea was T. L. Kelley, whose Crossroads in the mind 
of man (11) outlined the practical possibilities of factor analysis which are 
now beginning to be exploited. An extensive study by Spearman and Hol- 
zinger spent many years in the search for imitaiy mental traits. These and 
other ground-breaking studies have received less attention than the work of 
Thurstone; the claims and possibilities of factor analysis are best illustrated 
in his work, however, 

Thurstone^as interested in grouping the many types of e.\ercises used in 
intelligence tests, to determine what major sub-abilities were to be found. 
His^irocedure was to give a large number of mental tests to the same sub- 
jects, compute intercorrelations, and make a multiple-factor analysis (18). 
The first stu dy, which yielded nine factors, has since been supplemented by 
many similar analyses. Several of the factors have been found repeatedly, 
leacUiigl'-TKu’r-stdBe to conclude that there aie seven major types of ability 
which account for most correlations among mental-test items. These factors ^ 
are;;5l3atial (S), Perceptual (P), Number (2V), Verbal (V), Word Fluency 
'(^), Memory (M), and Reasoning (R). In addition, various minor factors, 
found in only a few test exercises and difficult to isolate and identify, bave 
been. 51 lggested. One important factor found in studies other than Thiustone s 
is a speed factor; the speed element was held constant in Thurstone s studies, 
but would otherwise have accounted for part of the intertest relationship. 

followed his preliminary investigations witli_atte.tnpts. Jo. con- 
struct tests which w ould measure ..^e.„separate primary abilities. S gyera l 
batterierfoT diffexen t age levels liave been produced, i n which each tes^s 
saturated wit li only one of the major factors. T hese t ests yielJ a_diagnpsis 
or breakdowr of mental development,, usually into six scores, (P has been 
6inittecl5T"Since tlm^must be allowed for a reliable measurement of each 
factor, the battery is long. The basic set of tests for ages 11 to 17 requires 
nearly two hours. This set of tests is compared with other differential bat- 
teries in Table 45. More recently, brief editions claiming to measure factors 
in only five minutes each have been placed on the market. No acceptable evi- 
dence of reliability is available and some of the tests are probably undepend- 
able. , 

Although verbal tests form a cluster in a set of heterogeneous tests, when 

a large battery of vp.rbaLtestsis.factor&d i t breaks down jnto many sub-clus-^ 
ters. Each domain has been found to consist of inany; abilities, ro that it is „ 
neces^v to speak 'of'^the verbal, . abihties,” “the numW-ftbiUtifis,” and so 
'^nTOcc^ionally su^'reWd analysis carries implications for vocational 
guidance and selection. Bechtoldt has analyzed the perceptual domain, 
which, is relevant to ineclianical tasks and visual inspection jobs. He found 
the following somewhat distinct abilities in such a battery: ability in choice- 
discrimination tasks, facility with familiar symbols, facility in associational 
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recognition, facility in restructuring perceptual material, and facility in 
organizing simultaneously presented stimuli during distraction, etc. (1). 
A psychologist trying to identify .superior inspectors could not use just any 
perceptual task to mea.sure the perceptual element in job success. The ele- 
ment important in a particular inspection process might be any of Bech- 
toldt’s factors, so that tests measuring each factor would probably have to 
be tried. 


The Primary Abilities 

While factor names are as descriptive as possible, the only way to under- 
stand the meaning of a factor is to study the types of test exercise where it 
appeara. The following section describes the major Thurstone factors, with 
reference to Thurstone’s tests and Wright’s factor analysis of the 1916 Stan- 
ford-Binet (24). 

The verbal factor V is found in the usual vocabulary tests, and in tests 
of comprehension and reasoning. Wright found verbal loadings in Binet 
vocabulary, comprehension, verbal absurdities, dissected sentences, and 
other tests, Among the items Thurstone uses to measure V are: 


Today much of our 
rather than for 

Synonyms-, quiet 


clotliing is designed to make a fashionable appearance 
style protection children sale dresses 
blue still tense watery 


The word-fluency factor W, which is only moderately correlated with V, 
calls for ability to think of words rapidly, as in anagrams and rhyming. The 
distinction between V and W is shown in two synonym tests tried by Thur- 
stone (19, p. 3). A test requiring the subject to select the cor rect sy norrym 
from several choices was saturated wrtli. V but not W; a test in whi ch he 
rapidly ■'supplied three synonynrs for each of a list of. easy wo rds ihea sured 
" Wj rrot V . Iir the primary abilities tests, one W exercise requires the sirb- 
ject to list four-letter words beginning with c as fast as he can. 

The spatial factor S was mentioned iir comrcctiorr with Harrell’s study. 
S tests deal with visual form relationslrips. In the Binet, spatial loadirrgs ap- 
peared in the ball-and-field te.st, copying a diamond, drawing a design from 
rnemoiy, and inductioir from observed jraper-cuttiirg. One of Thurstorre’s 
S tests is illustrated in Figure 10. 

The nrrmber factor N appears in .simple arithmetic tests. Tests of arith- 
metical skill are more saturated witlr N than tests of arithmetical reasoning. 
Counting from 20 to 1 and repeating digits backward are among the Binet 
items with number loadings. 

The memorizing factor M appears in tests which call for rapid rote learn- 
ing. One way to test this is to present nonsense material ( such as first-last 
name pairs) for a short study period, then to test recall after an interval. 
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Reasoning, R, occurs in tests requiring induction of a rule from several 
instances, as in number series and in drawing conclusions from syllogisms. 
Many Binet items contain reasoning factors. Considerable loadings in rea- 
soning factors were found for the ball-and-fleld test, verbal absurdities, 
interpretation of pictures, and other tests. The reasoning cluster is complex 
and has not been thorouglily studied. The primary abilities tests make use 
of such items as the following: 

Letter series (Mark the letter which comes next) : 

abxcclxefxghx . . . hijkxy 

Letter groupings (Mark the group that is different) : 

AAAB AAAM AAAR AATV 


Another of the tests is “Pedigrees,” illustrated in Figure 49.“ 


Jim — 1 

[— Alma 

1 





^ 1 


1 

1 Susan 1 1 Henry | 

1 Irene 

1 Elhi 


Arthur | 


This chart tells you that Jim and Alma were married and had three children, 
Henry, Irene, and Ella. Henry married a girl named Susan, and Ella married a 
man named Arthur. 


Now answer these questions by consulting the chart. 
Irene’s brother is 


Jim Henry Arthur Ella Susan 

nuiua luiuNi lucuii aiwm uasm 

How many children has Alma? 1 2 3 4 5 

^ um Bi» tsa MS 

Irene’s brother-in-law is 

Henry Susan Ella Arthur Jim 

ttimitt ^ lutinn stissss Ktima khiss 


Fig. 49. Pedigrees, a test of factor R. 

18. Which of the primary abilities seem to be represented in each of the follow- 
ing tests: ACE Psychological (pp. 185 ff.), SRA Non-Verbal (p. 52), and 
Henmon-Nelson (p. 176)? 

19. Four scores in the primary abilities tests for ages five and six (V, P, Q [similar 
to N], and S) are added to obtam a total “intelligence quotient,” representing 
“general learning ability.” In what ways does the meaning of this score differ 
from that of other intelligence ratings for school ages? Which type of informa- 
tion about the pupil is most “valid”? 

20. This statement comes from the manual for one of the primary abihties series: 
“Word Fluency. This factor involves the ability to think of isolated words 

“ From the Chicago Test of Primary Mental Abilities, Science Research Associates, 

Chicago. 
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wilh little concern for their meaning. It is important in school courses where 
written assignments arc stres.scd, and in occupations requiring creative writ- 
ing or reporting.” 

a. What evidence is required to demonstrate that the second sentence is true? 

b. Describe a school assignment where W, rather than V, might be important. 

Are Primary Abilities “Real"? 

Throughout the history of factor analysis, tliere has been jconsidejable 
debate regarding the reality of mental factors. Much of this debate is due 
to misunderstanding of the differences between systems of analysis, or of 
the statistical methods involved. Wolfle, who summarizes the development 
of factor analysis, makes this comment ( 23, p. 26 ) : 

"Thomson, Thurstone, and Tryon have repeatedly criticized the naivete of 
supposing that every factor neces.sarily represents an ultimate, and unitary mental 
ability. None of the major .students of factor analysis ever held such a view, 
but some of their critics have fallen into the easy error of accusing both Spearman 
and Thurstone of it because of the names they have given to their factors. Spear- 
man’s concept of g as the total fund of menial energy, and Thurstone’s ‘primary 
traits’ and ‘primary abilitic.s’ are easily misinterpreted. The ordinary connotations 
of the word ‘ 2 irimary’ are such as to foster the belief that Thm'stono has, or be- 
lieves he Kas," isolated the bassic and ultimate caus es of differences in ability.” 

Factor analysis is in nonsense comparable to-tlm chemifJt’s search-for ele- 
ments. There “is only one answer to the question: What elements make up 
table salt? But in factor analysis there are many answers, all equally true 
but not equally convenient. The factor analyst may be compared with the 
ph otog rapher trying to joicture a building as revealingly as possible. ^jVher- 
ever he sets his camera, he will lose some information, but by a skillful choice 
he will be able to show a large number of important features of the building. 

Primary traits arc not indivisible. They are merely families in which many 
interifilated .abilities can be found. Their diag nostic importance lies in the 
faet-that the separate families are less highly correlated than are tests wifliih 
a gy o ne family. On this point hinges one of the major mis underst andings 
regardi ngT actbr analysis. Multiple-factor analysis lias not disproved.. the 
existence of a geiieral ability. On the contrary, Thurstone’s factors,.are them- 
^Ives correlated (Table 36). More and more studies now sTiow that a gen- 


Table 36. Intercorrelations of Six Primary 
Abilities for Eighth-Graders (21, p. 37) 



N 

W 

y 

S 

M 

R 

N 


.47 

.38 

.26 

.19 

.54 

W 



.51 

.17 

.39 

.48 

v 




.17 

.39 

.55 

S 





.15 

.39 

M 






.39 

R 
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eral factor may be extracted from such tests as those yielding the primary- 
abilities theory . Thurstone speaks of the primary factors as “dilferent media 
for the expression of intellect” (20, p. 3). Whether the general factor is as 
prominent among adults as among children is still unsettled (2, 5, 22). 

The factor a naly-si s of a test will be somewhat different when based, on. dif- 
fereiff_groups of subjects. Harrell’s peg-sorting test might measure peg- 
^rting jkill in a young group, but when older subjects were induded most 
differences, were accounted for by motivation. One must be particularly 
h esitant about assuming that a test measures the same factors in widely 
di fferent groups, such as normals and psychotics ( 21, pp. 2 ff. ) . 

One major question relates to the validity of the primary abilities for 
guidance and other practical application. There is little evidence today which 
pennits practical interpretation of a Primary Mental Abilities- (PMA) .test 
profil e. Those who claim that factor V predicts linguistic subjects, N predicts 
m athem atics, and S predicts art or architecture have leaped to conclusions,. 
'Thurstone _dpes declare that the te.sts have been used for over half a million 
cases in Chicago high schools and have been useful in understanding pupils. 
Says he, " It requires oft en considerable insight of the examiner to relate the 
menta l profile to Ae circumstances of each case, but there is no question 
but .that th§ profile is more helpful than the IQ in the interpretation of 
educational and behavior problems” (20, p. 10). This is an encouraging 
statement, but at present there is no published research to assist a counselor 
in making inferences about, say, a pupil with poor M and high R, or low V ^ 
and aTOrage W . Adoption of the tests in school or industry is probably pre- 
mature., 

Some critics of factor studies have been disappointed because not all 
factors seemed to measure practically significant mental abilities. Xelley 
has expressed a fear that some of the factors are “mental factors of no im- 
portance” (12). P robab ly the correct position to take, is that faptor studies, 
clarify whaV.pr esent te 5 t 5 „measuje.,They cannot identify factors not built into 
th e origin al . tests^ They cannot guarantee to produce factors of practical 
importan ce. But by clarifying the nature of tests they permit the psychologist 
to'decide whether he is satisfied with the present content of his instruments. 
TooTThe'liorting'out'bf primary abilities directs research to the question: 
For what are these particular human talents useful, and how can we capital- 
ize on them? 

One of the many .Studies, of the r elation between f acto red abiliti es _and 
school performan ce is that p^Swineford. Her work was based on a bi-factor 
analysis, which ledlo separate estimates of the general, verbal, and space 
factors. The correlations of these factors with high-school marks were as 
shown in Table 37. The column headed “Multiple Correlation” shows the 
degree of prediction possible by the best combination of the three scores. 
It may be concluded tliat factors V and S add little to the prediction based 
on general mental ability alone. A similar study by Shanner, with a set of 
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Table 37. Prediction of Marks from Factor Scores (17) 


Subject 

N 


Factor 



General" 

Verbal 

Spatial 

Correlation 

General science 

110 

.74 

.19 

.21 

.74 

Biology 

36 

.65 

.11 

.07 

.66 

English 

110 

.53 

.39 

-.08 

.64 

English 

40 

.65 

.11 

.13 

.66 

Civics 

35 

.56 

.30 

.05 

.69 

Art 

108 

.68 

.16 

.24 

.69 

Algebra 

SO 

.54 

-.03 

.12 

.57 

Geometry 

36 

.54 

.11 

.14 

.55 

Mechanical drawing 

14 

.85 

-.47 

.78 

.87 


° The test for g included items resembling Thurstone tests N and R, in addition to general-factor load- 
ings from (he verbal and sxiatial tests. 


Thurstone’s tests, showed correlations of only moderate size between ability 
scores and achievement tests, and some of them in directions conti-ary to ex- 
pectation. (Example: N with advanced French, .34; with algebra, .12; with 
general science, —.32 [15].) A large number of studies at the college level 
give similar results (6). One of the more informative correlates seven PM A 
scores ( R is broken in two parts and W is not tested ) with grades for home 
economics students, and also correlates the grades with the Pressey and 
Otis tests, both of which are brief measures of general ability. The following 
listing includes all correlations over .30 (and the highest correlations for 
the art course) : 

Art: Inductive, .26; S, .25; V, .24. 

Chemistry: N, .46; deductive, .43; inductive, .37; Pressey, .38; Otis, ,37. 

English composition: V, .55; Otis, .54; Pressey, .48. 

Home Econ. 101 (Personal Problems): V, .50; inductive, .35; P, .31; Pressey, 
.41; Otis, .48. 

Home Econ, 109 (Nutrition): V, .37; N, .33; deductive, .33; Pre.s.sey, .41; Otis, 
.43. 

The best combination of two or three prim ary, factor scores should predict 
grades a little better, in most subjects, than the short general mental test. 
The improvement of correlation is quite small, however, especially in view 
of the longer testing time used. All studies of prediction indicate that in 
its present stage of development factorial approach has not produced 
tests which are superior, to,. B 9 ii-:factorial diagnostic tests foi;^ practical pur- 
poses. 

It was once hoped that only a few scores would be needed to describe 
fully the significant abilities of men. This hope has been dashed by increased 
experience with factor studies. According to Davis (4, p. 59): 

“The results of testing hundreds of thousands of men in the armed forces and 
of analyzing these data suggest to many psychologists that the number of basic 
mental abilities may often have been underestimated. From factorial analyses of 
nuuiy different matrices of intercoirelations obtained as a result of testing aviation 
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cadets in AAF classification centers, factors that have been mathematically deter- 
mined have been named as indicated in the following list. 


Carefulness 
General reasoning I 
Integration I 
Integration II 
Integration III 
Judgment 
Kinesthetic motor 
Lengdi estimation 
Mathematical background 
Mathematical reasoning 
Mechanical experience 
Memory I 
Memory II 
Memory III 


Numerical 
Perceptual speed 
Pilot interest 
Planning 

Psychomotor coordination 
Psychomotor precision 
Psychomotor speed 
Reasoning II 
Reasoning III 
Social science background 
Spatial Relations I 
Spatial Relations II 
Spatial Relations III 
Verbal 
Visualization 


“There is no objective method of determining whether the names attached to the 
facto rOQscovered. in the analyses are accurate descriptions of the mental abilities 
represented by the factors. In any case, . . . the number of basic mental abilities 
jnayJjp, much, larger than was fonnerly believed.” 

21. Which correlations in Swineford’s study were most changed by adding V and 
S as predictors? 

22. Are there any areas of mental ability, which might yield additional factors, 
wliich would not have been included in studying aviation classification but 
might be important in other studies of talent? 

23. How closely do the Thurstone factors correspond to the definitions of intel- 
ligence given in Chapter 6 (p. 107)? 

24. According to the manual, the PMA tests for ages five and six should be given 
at the start of the first grade. “The tests may be given at the end of the kinder- 
garten year, but mental gi'owth is so rapid at this age that more valid meas- 
ures can usually be obtained at the time of entrance to first gra^^ 

a. Why would mental growth aflFect validity, as well as level (!l||Bres? 

b. In view of the quoted .statement, are scores obtained early in the first 
grade likely to be trustworthy? 
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CHAPTER 10 


Tests of Special Abilities 


This chapter describes tests of special abilities widely used in vocational 
guidance and industrial selection. The purpose in describing these tests is 
to acquaint the reader with the variety of tools that have been explored and 
the ingenuity used in designing tests. No particular test or group of tests 
is of outstanding importance. The tests are grouped according to the fol- 
lowing categories: spatial and perceptual abilities, p.sychomotor abilities, 
mechanical knowledge, artistic abilities, and sensory acuity and discrimina- 
tion. 


SPATIAL AND PERCEPTUAL ABILITIES 
Spatial and perceptual tests have already been encountered in Thurstone’s 
and Harrell’s factor analyses, and as parts of many tests of general mental 
ability. Spatial tasks are illustrated in Figures 10 and 48, and in Figure 50 
in this section. Tests of spatial abilities require the subject to reason about 
forms, or to recognize relations between them. Some tests are two-dimen- 
sional, and others require the solution of three-dimensional problems. Both 
pencil-and-paper and apparatus tests are used. 

Spatial abilities have been considered important in engineering subjects 
such as mechanical drawing, descriptive geometry, architecture, and the 
like. Such a factor might be important in shop training and industrial jobs 
which call for judgment rather than mere manipulation. Validation studies, 
some of which are listed in Table 38, show that spatial ability is a useful 
predictive factor, but it alone rarely accounts for the major differences among 
students or workers. Such tests must be combined with other predictors. 

A widely known apparatus test of spatial ability is O’Connor’s Wiggly 
Block. This test consists of a large block (9 X 9 X 12 inches) which has been 
separated into nine longitudinal segments by wavy cuts with a jig saw. The 
parts are scattered in front of the subject, and tlie time he requires to as- 
semble them is noted. The test was reported by O’Connor as a measure of 
“spatial visualization.” Differences in die way the subject attacks the test 
appear to be of major importance in determining the score. Some people use 
a sheer trial-and-error approach, trying to assemble the block with great 
haste. Others study the pieces deliberately and then assemble the block. 
Some begin well, but are so concerned when unable to find the right block 
for a particular spot that they become disorganized. All these characteristics 
suggest that perhaps the test reveals behaviors which would be significant 
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Table 38. Empirical Validation of Spatial Tests 


Tesf and Publisher 

Sample 

Criterion 

Correlation 
of Test and 
Criterion 

Remarks and Reference 

Wiggly Block 
(Stoelting) 

Engineering 

students 

Grades in descriptive 
geometry 

.30 

Test described in text 
(22). 

Mann Stalicube (Mis- 
souri Educ. Test Co.) 

Engineering 

students 

Grades in descriptive 
geometry 

.63 

Pencil-paper test with 
high difficulty. May 
overlap with intelli- 
gence tests (3). 

MacQuarrie Mechani- 
cal Ability (Calif. 

Test Bureau) 

Dentistry 

students 

Grades In “technique” 
courses 

.25 to .46" 

See Fig. 57, 58 (23). 

MacQuarrie: Location 
subtest 

Adding-ma- 

chine 

operators 

Production on work 
sampfe 

.31 

Other subtests less ef- 
fective, except Trac- 
ing (r = .38) (28, p. 
149). 

Minnesota Paper Form 
Board (Psych. Corp.) 

Apprentice 

pressmen 

Ratings of skill, com- 
pared to others of 
equal experience 

.58 

Job requires fine per- 
ceptual judgment and 
mechanical adjust- 
justments (9), 

Minnesota Paper Form 
Board 

Lampshade 

sewers 

Production speed 

-.04 

(28, p. 221) 

Minnesota Paper Form 
Board 

Merchandise 

packers 

Production speed 

.48 

(28, p.221) 

Minnesota Paper Form 
Board 

Pull-socket 

assemblers 

Production speed 

.05 

(28, p.221) 


" Whpiu more than one correlation In civen in tables of tills type, seveinl estimates have been made. 


on the job — ^but these are significant in the realm of personality, not ability. 
Since ability cannot be measured without some control of personality factors, 
scores on the Wiggly Block do not measm'e aptitude alone. It does provide 
an interesting opportunity to observe personality. Remmers and Smith ( 22 ) 
have been critical of the reliability of die Wiggly Block, which ranges (retest 
method) in the .70’s for the three trials recommended by O’Connor. A total 



Fig. 50. Minnesota Paper Form Board items. 
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of ten trials would be reliable enough for guidance. Reliability is a common 
problem in psychomotor testing. Puzzles, where luck plays a part, are 
especially likely to be unreliable tests. 

Perceptual tests emphasize rapid observation rather than reasoning. Per- 
ceptual tests have not been prominent in practical selection, although in a 
few fields perceptual abilities may be important. These are thought to in- 
clude factory inspection jobs, aircraft recognition, bombardiering, and radar 
operation. Perceptual factors have also been studied intensively in research 
on reading, and perceptual tasks are used to diagnose poor readers or to 
determine whether pupils entering school have adequate ability to learn to 
read. 

A special type of perceptual test is used for measuring “clerical aptitude.” 
In the Minnesota Clerical Test, which is typical, the subject goes through a 
list like tlie following, noting errors as rapidly as possible: 

79542 79524 

5794367 5794367 

John C. Linder ^John C. Lender 

Investors Syndicate Investors Syndicate 

A factor study shows this ability to be distinct from other perceptual abilities 
involving judgment of forms (34). Validation studies of perceptual tests are 
illustrated in Table 39. 


Table 39. Empirical Validation of Perceptual Tests 


Test and Publisher 

Sample 

Criterion 

Correlation 
of Test and 
Criterion 

Remarks and Reference 

MInnesola Clerical Test 
(Psych. Corp.) 

Groups of 
clerical 
workers 

Supervisors* ratings 

.2810.42 

Criterion believed unreli- 
able (21, p. 208). 

Minnesota Clerical 

Punch-card 

operators 

Volume of errorless 
production 

.24 to .33 

Correlations based on parts 
of test (28, p. 1 43). 

Minnesota Clerical 

Coding 

clerks 

Volume of erroriess 
production 

.38 to .46 

Correlations based on ports 
of test (28, p. 1 45). 

Speed of identification 
(Air Force) 

Bombardier 

students 

Graduotion-elimina- 

tion 

.14 

(8. p. 264) 

Visual section in Read- 
ing Aptitude Tests, 
Primary form 
(Houghton Mifflin) 

First-grade 

pupils 

Reading achievenient 
at end of year 

.60 

Tests of orientation, visuol 
pursuit, memory for 
forms. Visual, auditory, 
and other tests combined 
correlate .75 with cri- 
terion (17). 

Lee*Clark Reading 
Readiness (Calif. 

Test Bureau) 

FIrst-grode 

pupils 

Reading achievement 
at end of year 

.49 to .68 

Intelligence test correlates 
only .40 with criterion. 
Test calls for matching 
letters (15). 


1. What characteristics of the jobs involved might account for the variation of 
validity coeificients for the Minnesota Paper Form Board from .58 to —.04 
(Table 38)? 
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2. How might llie tenth trial on the Wiggly Block reflect a different characteristic 
of the subject than the first three trials? 

3. Under what circumstances might a test such as the Minnesota Clerical have 
low coiTelations with production among workers already successful in an occu- 
pation, and yet be highly valid for selecting new employees? 

PSYCHOMOTOR ABILITIES 
Grouping of Psychomotor Abilities 

If one defines psychomotor to refer to all tests demanding mind-muscle 
cooperation, many sorts of test are included. We will concentrate attention 
here on tests useful for vocational selection. These tests deal with speed, 
simple reactions, and coordination, as well as perceptual factors. 

Many people have investigated whether there is a general motor ability 
running through all skilled behaviors. Studies show that various significant 
motor behaviors are largely independent. A number of factor analyses now 
suggest the following list of group factors in psychomotor tests ( 7, 10, 20, 
33, 34): 

Spatial 

Perceptual 

Manual dexterity ( Harrell's “agility” ) 

Stereotyped rapid movement (“aiming” ) 

Steadiness 

Strength 

There arc other important elements each found in only a few psychomotor 
tests. In this chapter, it is convenient to group psychomotor tests as follows: 
speed and simple reaction time; steadiness and simple controlled move- 
ment; choice reaction time; and complex coordination. Tests grouped to- 
gether are not necessarily highly correlated. Psychomotor tests arc largely 
independent, so that wheir selecting workers one must find a particular per- 
formance test similar to whatever task one is investigating. One cannot rely 
on a few “primary” psychomotor abilities to include all important factors. 

Speed and Simple Reaction Time 

Among the methods used to determine the speed of simple reactions are 
tapping tests. These may use an electrical circuit, the subject tapping a key 
as rapidly as possible, or may use a pencil-and-paper form such as the Mac- 
Quarrie tapping test in which the subject places three dots in each printed 
circle as rapidly as possible. Speed tests generally measure finger, hand, or 
hand-and-forearm movements, 

A slightly more complicated test of speed is the Minnesota Rate of Ma- 
nipulation Test. Tins uses a board, containing rows of two-inch holes, and a 
number of loosely fitting round blocks. The subject places the blocks in the 
holes as rapidly as possible. Since everyone can do the task, differences in 
score are almost entirely attributable to speed. In the turning test, the same 
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equipment is used, but the subject turns each block over in its hole. The 
correlation of placing and turning scores, similar as the tasks appear to be, is 
only .57 ( 36). 

Reaction time is tested by special timing equipment. The subject may be 
required to throw a switch as soon after a light flashes as possible, or to step 
on a ‘Tbrake” in response to a signal. For the refined measurements of re- 
action time required in experimental psychology, quite elaborate devices 
have been developed. 

Tests of speed predict performance in extremely simple activities. There is 
little relation between speed of simple and complex reactions, or between 
simple reaction time and reaction time when judgment is required. Validity 
data are reported in Table 40. 


Table 40. Empirical Validation of Motor Speed Tests 


Test and Publisher 

Sample 

Criterion 

Correlation 
of Test and 
Criterion 

Remarks and Reference 

Minnesota Rale of 
Manipulation (Educ. 
Test Bureau) 

Packers of 
facial 
tissues 

Ratings of speed on 
fob 

.66 

Based on three part scores 
combined (36). 

Minnesota Manipula- 
tion 

Can packers 

Production speed 

.35, .22 

Based on separate 
scores (2B, p. 227). 

part 

Minnesota Manipula- 
tion 

Pull-socket 

assemblers 

Production speed 

.09, .19 

Based on separate 
scores (28, p. 227). 

part 

Disk-setting, similar to 
Minn. Manipulation 

Adult women 

Ratings of skill In 
packaging, etc. 

.56 to .77 

Median, .67 (32). 


Dotting, similar to 
MacQuarrIe Tapping 

Adult women 

Ratings of skill In 
packaging, etc. 

.31 to .83 

Median, .64 (32). 



Steadiness and Simple Controlled Movement 

Steadiness is required where one must hold a posture or must trace a 
pattern accurately. Postural steadiness has been tested by the ataxiameter. 
One such instrument consists of a small balanced platform on which the 
subject stands; movements of the platform are recorded automatically. Arm 
steadiness is tested by measuring the subject’s ability to hold a stylus out- 
stretched in a small aperture without touching the sides of the hole. The 
stylus and hole are connected electrically, and each coirtact is registered on 
a counter. A test of hand steadiness requires the subject to move a stylus 
along a groove with narrowing walls. 

Aiming also may be measured by a stylus-and-hole apparatus. The subject 
is required to tlmast the stylus into progressively smaller holes without 
touching the sides (Fig. 11), or into holes momentarily uncovered by a 
rotating shutter. 

4. Decide which type of steadiness test would be most promising for selection of 
operators for each of the tasks listed below. If none of these tests seems fully 
suitable, attempt to describe one more comparable to the job. 
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a. A jigsaw operator is to move a board, about eight inches square, so that a 
curved pattern is cut out. 

b. A rifleman must hold his sights steadily on a target while resting on an elbow 
in prone position. 

c. A pistol marksman must hold his sights steadily on a target while standing. 

d. An engraver must follow a pattern with great precision using a small power 
tool. 

Various measures test coordination or dexterity. The simplest coordinations 
require ability to use opposing muscles or two parts of the body coopera- 
tively. Pursuit rotors have been a popular coordination test. In one common 
pursuit device, a small disk is set in a phonograph-like turntable. The sub- 
ject is given a stylus and must follow the disk as it traces a circular or eccen- 
tric path. Electrical connection between the stylus and disk makes it possible 
to record the total time the subject holds contact. 

The Two-Hand Coordination (or ‘lathe”) Test is more complicated. The 
subject must follow a slowly ino\'ing disk with a pointer mounted in a car- 
riage. By turning one handle with his right hand, he can move the pointer in 


Table 41 . Empirical Validation of Tests of Simple Coordination 


Test and Publisher 

Sample 

Criterion 

Correlation 

Remarks and Reference 

Steadiness, hand" 

Dentistry 

students 

Four-year dentistry 
grades 

.04 

(11) 

O'Connor Finger Dex- 
terity (Stoelting) 

Dentistry 

students 

Dentistry grades 

.15 

(11) 

Finger Dexterity 

Dentistry 

students 

Dentistry grades, all 
courses 

.20, ,30 

r's range >00 to .38 for 
specific courses (2 1 ). 

Finger Dexterity 

Pull^socket 

assemblers 

Production speed 

-.09 

(28, p. 235) 

Scott Three-Hole 
(Stoelting) 

Girls in train- 
ing 

Ratings as power- 
sewing-machine 
operators 

.54 

(Cf. Table 43.) Any single 
test practically valueless 
in eliminating failures 
(31). 

MocQuarrie Test of 
Mech. Ability; Dot- 
ting (Calif. Test Bu- 
reau) 

Punch-card 

operators 

Volume of errorless 
production 

.19 to .27 

Other subtests gave lower 
correlation; Tapping, 

.19; Copying, ,15; Pur- 
suit, .12, etc, (28, p. 
143). 

MocQuarrie; Tracing 

Adding- 

machine 

operators 

Production in work 
sample 

.38 

(28, p. 148) 

MocQuarrie; Dotting 

Adding- 

machine 

operators 

Production in work 
sampie 

.18 

(28, p. 148) 

Pegboord (USES) 

Con packers 

Production speed 

.45 

(28, p. 148) 


" Steadiness tests are published by Stoelting and other firms. 


one coordinate; by turning another handle simultaneously with his left hand, 
he can obtain movement in the perpendicular direction. Both hands must be 
used constantly to pursue the disk on its irregular path. Total time of contact 
is recorded. 

Dexterities are measured by devices emphasizing both speed and accu- 
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racy. The O Connor Tweezer Dexterity Test is one of the most widely known 
examples (Fig. 2). The subject is given a trayful of pins to be set I'apidly in 
holes by means of tweezers. Finger dexterity is tested by tasks requiring the 
subject to screw nuts on bolts, or place pegs in holes. 

Validities of tests of .steadiness and simple movements are illustrated in 
Table 41. 


Choice Reaction Time 

Few skilled jobs call for simple instantaneous reaction where the subject 
can set himself and know in advance what reaction will be signaled for. 
Instead, the locomotive engineer, the gunner, and industi-ial control opera- 
tors must make instant choices among die responses open to them to select 
the one most appropriate. Tests of complex reaction time present series of 
signals, with the subject to make the response associated with each signal. 
An example is the Discrimination Reaction Time test used by the Air Force. 
Two green and two red lights are mounted in a square pattern before the 
subject, together with four switches. The subject is to consider any pair of 
lights that flashes as if it were an arrow, the red light being the head. If this 
“ari'ow” points upward, for example, the uppermost switch is to be thrown as 
quickly as possible. Problems are presented in rapid succession. 

Complex Coordination 

With complicated machinery to be managed by simultaneous control of 
levers, pedals, and. wheels, the operator must perform complex acts of skill. 
A conspicuous example is the task of piloting an airplane. The pilot must 
manage controls for rudder, elevator, and tail surfaces simultaneously, per- 
haps moving them in different directions and at different speeds. To test 
aptitude for this task, the Air Force used a Complex Coordination Test in 
which the subject, responding to light signals, was required to make the 
three types of motion used in piloting (Fig. 51). 

This task represents the "job miniature” or job replica test. There is reason 
to believe that a different complex coordination test must be devised for each 
job. To make the test for pilots as much like the job coordination as possible, 
the equipment used and the motions required resemble the job as much as 
possible. In this case, the test equipment consists of a “stick” which moves 
like that in the plane, and foot rudder controls. The stimulus is a panel of 
lights which partially duplicates the type of signal given by turn and bank 
indicators. The advantages of replica tests are that they tend to require the 
same aptitudes as the job itself, that they appeal to applicants as realistic, 
and that they avoid the danger and expense of a tryout on the actual equip- 
ment. The name “miniature,” although widely used, is misleading. Unless a 
replica is built to the same scale as the job equipment, and requires the same 
actions, it is not likely to measure the job aptitudes. 

Another type of work sample uses equipment just like that on the job. 
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Fig. 51. SAM Complex Codrdination Test (Mash- 
burn), a job replica used in pilot selection. Note ar- 
rangements for group administration of a psycho- 
motor test. Counters before the examiner record each 
man's performance. (Courtesy USAF.) 


Sometimes special per- 
formance tasks are de- 
signed which duplicate 
one specific segment of 
the joh to be predicted. 
An example is the Metal 
Filing Worksample, in- 
tended to measure that 
skill used in dentistry. 
This isolates one ele- 
ment of the job and 
measures it directly. 

Complex coordination 
tests are of consider- 
able value, provided 
they truly resemble the 
job to be predicted. 
The Metal Filing Work- 
sample correlated .53 
with grades in den- 
tistry courses (1). The 
I.E.R. Trimming Test,‘ 
in which one cuts be- 
tween a pair of narrow- 
ing lines with scissors, 
correlated .69 with rat- 
ings of power-sewing- 
machine trainees (31). 
The Hand-Tool Dexter- 
ity Test,“ requiring op- 
erations on bolts and 
nuts with wrench and 


screwdriver, correlated .46 with performance of machinists (4). 


General Considerations 

The essential feature of a good psychomotor test is that it measure exactly 
the abilities called for by the job. In other areas where there is a large gen- 
eral factor there are usually many adequate tests, but in tlie psychomotor 
area the test tliat serves best on one problem may be quite unsatisfactory in 
predicting the next job. Table 42 presents Air Force data on this type of test. 
Validities are the correlations between test score and success in completing 
a particular type of training. The tests are not interchangeable. They meas- 


^ Publisher: C. H. Stoelting Go. 
® Publisher: Psych. Corp. 
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lire different things, according to the intercorrelations. The Complex Co- 
ordination Test, good as a predictor of pilot .success, is no good as a predictor 
of bombardiering; and so on. 


Table 42. Intercorrelations of Psychomotor Tests for Aircrew Candidates 

(8, pp. 92 ff.)“ 



Com- 

plex 

Coor- 

dina- 

tion 

Two- 

Hand 

Coor- 

dina- 

tion 

Rotary 

Pur- 

suit 

Choice 

Reac- 

tion 

Time 

Aim- 

ing 

Stress 

Finger 

Dex- 

terity 

Validities 

Bomb. 

Nov. 

Pilot 

Complex coordination 


.46 

.30 

.27 

.14 

.22 

.13 

.31 

.40 

Two-hand coordination 

A6 


.37 

.25 

.17 

.25 

.12 

.29 

.35 

Rotary pursuit 

.30 

37 


.22 

.23 

.26 

.14 

.12 

.26 

Choice reaction time 

.27 

.25 

.22 


.11 

.19 

.25 

.41 

.28 

Aiming stress 

.14 

.17 

.23 

.11 


.19 

.00 

.09 

.12 

Finger dexterity 

.22 

.25 

.26 

.19 

.19 


.15 

.22 

.15 

Steadiness 

.12 

.07 


.00 


.16 



.02 

Rudder control 

.32 

.30 

.32 

.07 

— 

.11 

— 

— 

.30 


. combined from several tables. All correlations are based on large numbers of cases, leashes 

indicate data not avuilable. Validities differ slightly from those in Table 49s these differences are tte 
result of basing the calculations on different samples. 


The reader will have noted that psychomotor validity coefficients are gen- 
erally disappointingly low, but psychomotor tests improve prediction when 
combined or added to batteries of other types of tests. Some of the validity 
coefficients are low because the criteria are inaccurate measures of success. 

5. Which of the validity coefficients in Table 42 appear to support the principle 
that tests which require the same motor skills as the job are superior predictors? 

6. What hypotheses can be advanced to explain the low correlation of the steadi- 
ness test with bombardier and pilot success, since it is obvious that steadiness is 
needed in both jobs? 

Because of the number and diversity of manual tests, there is little value in 
a summary table of selected tests. The reader is referred to Bennett and 
Cruikshank, who review over 100 tests of manual and mechanical abilities. 
They arrive at the following general conclusions ( 3, pp. 4-5) : 

1. For vocational selection the motor tests should measure as nearly as pos- 
sible the movements required on the job. Thus for routine motor tasks the 
miniature test, built to duplicate the movements required by the per- 
former on the regular test, is probably best. 

2. Tlie test should, in most instances, involve a sufficiently large unit of 
work to measure fatigue and endurance rather than momentary capacity. 

3. The test should be ad m inistered at the time of applicant selection. As yet 
no adequate study has been made which would lead us to believe that 
long-range prediction for motor tasks is entirely satisfactory. 

4. For some skilled motor performances, requiring a diversity of motor op- 





222 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


erations, some of the more complex manual tasks may probably be used 
with good success. 

5. Vocational guidance (not selection) on the basis of motor skill alone is 
quite deplorable, except in the case of individuals who have gross inca- 
pacitating motor disabilities which may prove a deciding factor in voca- 
tional selection. 

6. Sufficiently inclusive norms have not been obtained for tests which are 
probably very useful. This means that the user of tests for applicant selec- 
tion should be constantly checking on the .standards for the job in ques- 
tion. 

If a test is supposed to measure aptitude, it is important to obtain a stable 
measure, characteristic of the person over a period of time. Unstable scores 
are obtained if we apply a psychomotor test without giving the subject a 
chance to learn the task in preliminary trials. On complicated testing devices, 
a subject cannot show his full ability until he has become familiar with the 
reaction required. Wulfeck gathered data with the Two-Hand Coordination 
Test. Ten subjects, selected on the basis of high or low initial ability, were 
given 10 trials on the equipment each day for 10 days. Subject A, who spent 
81.5 percent of his time on target in six preliminary trials, learned very 
rapidly, reaching 99 percent "on target” on the third day. Subject C, begin- 
ning at 77.1 percent on target, learned .steadily, reaching 99 percent on the 
fourth day. Subject Z, with tlie very low score of 52 percent before practice, 
rose even more rapidly, passing subject C on the second day and maintaining 
a 100 percent score on the foiu+h day and after. All other members of the 
“poor” group reached 96 percent or better on the fourth day. Since both the 
“good” and “poor” subjects showed nearly perfect ability on the task after 
becoming familiar with it, the first trials apparently measured their facility 
in adjusting to the new task, rather than their motor coordination. Wulfeck 
recommends that in such motor tests a minimum of 20 trials be used, the 
score on tlie last 5 trials being used as an index of ability ( 35 ) . 

7. Experimenters wish to study the effect of vitamin lack on motor perfoi-mance. 
They plan to test a group, then alter the diet, and test again after some time. 
Would it be desirable to offer training on the tests before the first measure- 
ment? 

8. a. Under what circumstances would measuring fatigue or endurance (2 in the 

numbered paragraphs above) lower the validity of a psychomotor test? 
b. Under what circumstances would it be desirable to measure endurance? 

9. Psychomotor tests of school children and adolescents have not been found 
good predictors of later vocational success in mechanical work. What explana- 
tions can be advanced for this ( 3, pp. 10-1 1 ) ? 

10. In view of the low correlations between psychomotor tests, can such tests be 
helpful for guiding junior-high-school boys who are considering learning a 
trade? 

11. Psychologists are now seeking pencil-paper tests of psychomotor abilities to 
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replace expensive apparatus tests. What types of apparatus test will be easiest 
and which most difBcult to replace? 

Psychomotor tests may be used fairly only to compare people at the same 
point on the learning curve. What a man does when he first tries the task 
does not always correlate highly with scores he earns later in the practice 
series (10, 26, 27). Scores tend to be more reliable after he has had practice 
on the task sufficient to remove the element of problem solving (26, 27). 
For most jobs, it is more important to know what coordination or speed the 
applicant can display after training than how fast he can adapt to a new 
task. Only where cost of training is high does the employer wony about 
speed of learning. The purpose in selection is to get workers who will pro- 
duce well at the end of training. 

In line with the evidence that good predictors resemble the job, J. L. Otis 
found that tests which predict quality on a job well may be poor predictors 
of speed (19). Correlations of predictive tests for sewing-machine operators 
are shown in Table 43, with both speed and quality criteria. Otis points out 
that workers suitable for a shop stressing quality may lack aptitudes needed 
in a shop seeking high volume of production. The user of psychomotor tests 
must have clearly in mind the nature of the job he wishes to predict. There is 
no all-round superior worker. ( On some jobs, even intelligence is a liability 
because brighter workers dislike routine. ) 


Table 43. Prediction of Quality and Quantity of Work of Sewing- 
Machine Operators (19) 


Test 

Correlation with 
Quality Criterion 
IN = 52) 

Correlation with 
Speed Criterion 
(N = 52) 

Minnesota Clerlcoli Names 

.36 

.08 

Minnesota Clerical, Number 

.26 

.22 

Poppelreuter Tracing (time score) 

-.31 

.45 

Poppelreuter Weaving 

.27 

.21 

Paper folding 

.30 

—.10 

Minnesota Spatial Relations (time) 

.24 

.28 

Minnesota Paper Form Board 

.32 

.17 

O'Connor Tweezer Dexterity 

.07 

.46 

O'Connor Finger Dexterity 

.20 

.27 

Minnesota Manipulation 

.08 

.31 

Otis Self-Administering (IQ) 

.17 

.1 1 

Tests with correlations In boldface, combined 

.57 

.64 


MECHANICAL KNOWLEDGE 
Information 

Performance tests deal mostly widi manipulation and motor speed. Tests 
of mechanical knowledge test information and problem-solving ability. They 
may be regarded as achievement tests, if the subject has learned to do the 
problem through experience and training. When such tests are used to 
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classify applicants, it is assumed tliat those who have no mechanical back- 
ground failed to acquiie it because of lack of interest. A man making a 
poor score is therefore considered a probable poor job risk, even if his trouble 
is lack of background rather tlian poor capacity. Stenquist found it helpful, 
in guiding adolescent boys, to determine their knowledge of tools. A set of 
pictures of tools and machines is presented in liis test, together witli ques- 
tions about their purpose or operation. 

12. What assumption is made about the experience of junior-high-school boys, in 
using the Stenquist test as a measure for vocational guidance? What additional 
data should be used in deciding whether a boy will do well in shop work? 


The U.S. Employment Service has done much to develop trade tests, for 
use where knowledge about only one job is important. These tests measure 
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Fig. 52. Scores on a trade test for engine-lathe operators 
(6, p. 493). 


knowledge about the job and are essentially short interviews. Many men who 
claim experience in a trade fail on the questions. During wartime, such a 
screen used in an employment center eliminated men who might otherwise 
have been sliipped across the country to a plant where skilled men were 
needed. Ti'ade questions are selected by analysis of job processes and tools. 
Questions are tested to eliminate any which would be unfair because of 
regional differences in methods of work or vocabulary. Thi'ee criterion 
groups are tested: expert workers, beginners in the trade, and workers in 
closely related trades. The items which discriminate are retained. Items from 
several tests are ( 28, pp. 45-47) : 

(Carpenter) What do you mean by a “shore” in carpentry? Arts. Upright brace. 
(Plumber) What are the two most commonly used methods of testing plumbing 
systems? Ans. Water, smoke, peppermint, air (any two) . 
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Which gear turns slower? 
(If equal, mark C.) 


At which point was the ball going 
faster? 

(If equal, mark C.) 


In which container will the ice 
cream stay hard longer? 

(If equal, mark C.) 


Fig. 53. Mechanical comprehension items from Differential Aptitude Tests. 

(Asbestos worker) In stitching canvas covering over pipes, where is the seam run? 
Am. Out of sight, back or top of pipe (either) . 

A good ti'ade test discriminates between novices, apprentices, journeymen, 
and experts. In Figure 52 we see how a test of engine-lathe operators func- 
tions. Such a distribution of scores permits one to classify a new applicant 
with little error; a score of 22 almost certainly indicates a journeyman. 

Comprehension and Reasoning 

Tlie Bennett Test of Mechanical Comprehension has been widely used. It 
has three forms, and, in addition, items have been adapted for military 
classification te.sts and for other aptitude tests. Bennett claims that “tlie prob- 
lems have been selected in such a way as to minimize the effect of training 
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and formal knowledge” (3, p, 39). Emphasis is on application of principles 
rather than specific verbal learnings. Illustrative mechanical comprehension 
items are shown in Figure 53. 

The Stenquist Assembly test is a performance test of knowledge or insight 
into mechanical problems. The tests, which range from third grade to adult 
levels, present simple mechanical devices: clothespin, push button, hinge, 
caster, etc. The parts of these devices are to be put together by the subject. 
The test was improved and issued as the Minnesota Assembly Test (Fig. 
46), which uses a wider variety of objects. Table 44 reports validity coeflS- 
cients for tests of comprehension and knowledge. 


Table 44. Empirical Validation of Tests of Mechanical Comprehension 


Test and Publisher 

Sample 

Criterion 

Correlation 

Remarks and Reference 

Cox Mechanical Aptitude: 
Models (N.I.I.P.) 

Engineering 

students 

Engineering grades 

.41 to .42 

Subject must explain ac- 
tion of hidden mecha- 
nism, seeing only Its 
effects (5). 

Stenquist Mechanical Ap- 
titude (Picture) II 
(World Book) 

Engineering 

students 

Engineering grades 

.43 

Stenquist Picture 1 and 
Assembly give negli- 
gible r's (13). 

Bennett Mechanical 
Comprehension (Psych. 
Corp.) 

Machine-tool 

operators 

Ratings of pro- 
ficiency 

.64 

(4) 

Stenquist Assembly 1 
(Stoeiting) 

Junior-high' 
school boys 

Ratings by shop 
teacher 

.77 

(29). Lower r's found by 
later Investigators, (20), 

Stenquist Picture ii 

Junior-high- 
school boys 

Ratings by shop 
teadrer 

.64 

(29). Lower r’s in other 
studies (20). 

Minnesota Assembly 
(Marietta) 

Junior-high- 
school boys 

Carefully scored 
quality of shop 
products 

.55 

Criterion correlates only 
.21 with IQ (20, p. 
204). 


13. From a study of the items, what would you say that tests of the Bennett type 
measure? 

14. What features of the Bennett test would improve or impair its usefulness for 
predicting college engineering grades? For predicting success of naval re- 
cruits in learning maintenance of Die.sel engines? 

15. What does the Minnesota Assembly Test measure? Is it a test of innate ability 
or of knowledge? 

16. TifiBn speaks of the Minnesota Assembly Test as follows (30, pp. 110-111): 

“One criticism that has been leveled against this type of test is that per- 
sons tested are likely to vary in their familiarity with the various items, and 
hence the test is likely to measure this familiarity rather than mechanical 
ability. In spite of the validity of this criticism from a theoretical viewpoint, 
the fact remains that, in practical situations, the test has shown itself to be 
a serviceable measuring scale. This fact, rather than a theoretical criticism, 
is the basis upon which it should be evaluated.”” 


® Reproduced by permission ol Prentice-Hall, Inc., from Industrial psychology. Second 
Edition, by Joseph TiEBn. Copyright, 1942, 1947, by Prentice-Hall, Inc. 
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Do such theoretical criticisms have any importance for psychologists engaged 
in selection of employees? 

17. a. The Purdue Assembly test is designed to include mechanisms using each 

important mechanical device: gears, levers, rack-and-pinion, etc. Does such 
a test assume that mechanical aptitude or comprehension is a single gen- 
eral ability, or that it is a group of specific abilities? 
b. If the latter theory is time, what implications does it have for selecting an 
employee for training in a skill such as watch repairing? 

18. Boys surpass girls on the Bennett test (2). How may this finding be explained? 

19. Mellenbruch reports validity eoeflBcients ranging from .50 to .60 for his Me- 
chanical Aptitude Test, which is somewhat Tike Stenquist’s picture test. The 
criteria used are teacher’s ranking of engineering drawing trainees (women), 
experience in mechanical activities, Stenquist scores, and scores on the Air 
Force Mechanical Information Test. What other validity studies are needed 
to support the recommendation that those scoring low on the test should not 
be hired for mechanical work or should be placed only in routine mechanical 
jobs? 


ARTISTIC ABILITIES 

Since general mental ability is no guarantee of success in artistic en- 
deavors, much study has been directed toward identifying talent in art. A 
variety of tests of artistic aptitude have been proposed, which illustrate how 
special abilities are investigated. If one is to measure “aptitude for art” he 
must analyze the work of tlie artist to determine what that calls for. Having 
made such an analysis, he must design tests for the abilities involved. The 
tests must not depend on special training, since this would give an untrue 
picture of a person with little talent who had begun artistic training. The 
tests, of course, are refined and validated in the usual ways, with the added 
problem that there is no completely adequate criterion for judging superi- 
ority as an artist. 

Different analysts have considered different abilities. The Meier Test of 
Art Judgment, the Lewerenz Test of Fundamental Abilities in Visual Art, 
and the Horn Art Aptitude Inventory are among the promising preliminary 
attempts in this field.* 

The Meier test, as its name implies, studies the extent to which the subject 
is a good judge of aesthetic qualities. As such, it measures taste rather than 
ability in using art media. Pairs of pictures are presented in which some one 
characteristic has been altered markedly in composition, shading, or tech- 
nique (cf. Fig. 54). The subject decides which of the two presentations is 
the more pleasing; if he agrees with “experts” who have taken the test, he 
gets a high score. Two further tests are planned, to measure “creative imagi- 
nation” and “aesthetic perception.” 

The Lewerenz test is a battery of measures to indicate likelihood of suc- 
cess in drawing and painting. Among the aspects of artistic ability measured 

’‘Publishers: Univ. of Iowa (Bureau of Edue. Research and Ser\'ice), Calif. Test Bu- 
reau, Rochester Athenaeum, respectively. 
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Fig. 54. Item from the Meier Test of Art Judgment. 


are preference for designs, drawing a sketch to fit a pattern, locating proper 
positions of shadows, art vocabulary, reproducing a form ( vase ) from mem- 
ory, correction of perspective, and color matching. 

Creative and artistic abilities at a fairly high level are tested by the Horn 
Art Aptitude Inventory, designed for estimating die probable success of ap- 
plicants to art school. Most of these persons will have had previous art ti'ain- 
ing. The test has three parts: a ‘scribble exercise to build confidence, a 
"doodle” exercise calling for simple compositions, and an imagery test. In 
the imagery test, the subject is given several cards, each bearing a pattern of 
lines. Around this set of lines, he is to sketch a picture (Fig. 55). The pic- 
tures are judged by art instructors as to imagination and technical drawing 
quality. Using careful scoring directions, competent judges can attain a cor- 
relation of .86 between independent scorings. A correlation of .66 is reported 
between freshman grades in art and scores at the beginning of the year (14). 
Further validation is in progress. The currently popular theory that ability 
and personality are in ti’uth inseparable is well demonsti-ated by this test. 
Hellersberg has taken the Horn test, aptitude measure though it is, and used 
it to diagnose personality (12). Her method is to interpret the test as a pro- 
jective technique ( Chap. 20 ) . 

The prediction of artistic success is only in the beginning stages. Reliabili- 
ties and validities of tests are far too low to permit final judgment of talent 
from test scores. Two studies have pointed the way for further investigations 
in this field. Meier, reviewing extensive biographical and experimental 
studies, concluded that artistic talent lay in six traits: fine eye and hf^nd 
coordination (craftsman skill), energy and concentration, intelligence (or 
some factors within it), perceptual facility (keenness of observation), cre- 
ative imagination, and aesthetic judgment (16). It has long been thought 
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Fig. 55. Specimen item from Horn test of art aptitude, showing stimulus lines 
and two drawings based on them. 


that the artist differs from other intelligent people in personality, perhaps 
more than in “aptitude.” Roe has opened this area to testing with the Rorsch- 
ach test (Chap. 20). She found that there is no one artistic personality, nor 
do successful artists seem to have “creative” personalities. It is true, however, 
that the style and character of pictures an artist produces are in many cases 
related to his personality structure (24). 

20. What criterion would the Meier test be expected to predict better than the 
Horn test would, and vice versa? 

21. What aspects of art ability if any do not appear to be measured by any of the 
tests described? 

22. How could one obtain evidence whether the Meier test reflects native capacity 
rather than experience with art? 
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23. In view of the limited reliability and validity of art aptitude tests, in what way 
may they be used for guidance in the public schools at present? 

SENSORY ABILITIES 

Tests of sutiple sensory judgments and acuity have played a large part in 
experimental psychology and the development of test methods. From a prac- 
tical point of view, only tests of vision and hearing are widely important. 
Only a cursory survey of the major types of vision and hearing tests can be 
offered here. 


Vision 

Because of the importance of visual skills in reading and industry, much 
research on vision has been done by medical and psychological investigators. 
The most refined measurement procedures are used by oculists, but many 
simpler techniques are used in school and industry. The range of visual abili- 
ties measured is impressive, and directs attention to the fact that most seem- 
ingly obvious psychological variables break down, when scmtinized, into 
many specific elements. 

Accommodation, the ability to form a sharp image of the object seen, is 
tested most simply by the familiar Snellen and broken-circle charts. Con- 
vergence, the ability to use the two eyes together, is tested in several ways, 
the most common of which employs a binocular viewer. Special test card.s 
permit the tester to determine whether the eyes are used together. These 
variables in turn are frequently subdivided. Near-point accommodation may 
not correspond to far-point vision. The two eyes must be tested separately, 
since they are often different. The eye muscles may work skillfully in hori- 
zontal tracking but coordinate poorly in vertical movements. And so on. The 
Telebinocular, which has been popular as a screening test for children, re- 
ports a total of 11 different scores. The Orthorater, designed for military and 
industrial use, analyzes vision into 12 separate scores. Both the latter tests use 
stereoscopic lenses to measure depth perception and coordination.® 

Color vision has been the subject of many tests. While original interest in 
color vision stemmed from theoretical problems, tests have been important in 
military selection and in some industries. The Ishihara test has been widely 
used; it consists of plates in which greenish-yellow numbers are embedded in 
a yellow-orange field. A person who cannot distinguish the green and orange 
tints sees both areas as uniformly yellowish and cannot identify the num- 
bers. Furthermore, a pattern of blue-green spots in red-orange background 
will catch the eye of the red-green-blind person, and cause him to report a 
different number, whereas the totally color-blind person will report no num- 
ber at all. Various refinements of this technique, known as tlie pseudo-iso- 
chromatic test, have been developed. Other techniques require sorting or 
matching of colored wools or tiles. 


® Publishers: Keystone View Co., Bausch and Lomb, respectively. 
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24. a. What criteria might be used to detennine whether the Telebinotular test 
is valid? 

b. What criterion would be most suitable to determine the usefulness of the 
test for the routine screening of school children to locate visual defects? 
(Cf. 18.) 


Hearing 

Ability to hear is tested both in terms of general lack of deafness and in 
terms of a wide variety of specific abilities such as pitch discrimination. The 
common method for testing general acuity is the audiometer, an electronic 



Fig. 56. Audiogram of an adult male. 
(Courtesy J. Donald Harris) 


instrument which produces tones of any desired frequency and loudness. The 
faintest tone the person can hear at each frequency is determined. This test 
again illustrates how abilities can be broken down into finer subdivisions. 
The audiogram in Figure 56 is not an unusual type. As the heavy line shows, 
this subject has normal hearing in the frequency range near 1000 cycles. He 
probably hears a radio or speech as loud as does the person next to him. But 
his hearing loss in the high frequencies is severe, even though he may think 
of himself as having perfect hearing. He will lose some of the high-frequency 
sounds that help make consonants intelligible, he will miss some of the 
“brilliance” of orchestral music, and he will be a failure at some military 
duties where detection of high-frequency sounds is important. 

Good hearing alone, however, does not mean that one will be superior in 
all hearing tasks. Military research established a need for special tests of 
ability to hear through noise. Commimication in airplanes requires under- 
standing of speech in the presence of a high noise level. Some persons with 
good general hearing could not do this. 

Seashore’s so-called Measures of Musical Talent (RCA) measure sep- 
arately sensitivity to pitch, intensity, time, timbre, tonal memory, and rhythm. 
Pitch is tested by a series of paired tones, the subject judging whether the 
second tone is higher or lower than the first. Similar devices are used to de- 
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Fig. 57. Profiles from the Seashore Measures of Musical Tal- 
ent. Profiles are rated A, safe; B, probable; C, possible; D, doubt- 
ful; E, discouraged, in terms of prognosis of success in the East- 
man School of Music. Letters at the left of the profile signify the 
tests given: pitch, intensity, time, consonance, memory, and 
imagery. 

termine the smallest difference to which the subject can react correctly in 
each aspect of hearing. The pitch, tonal memory, and intensity scales, which 
have satisfactory reliabilities, are good examples of tests having high logical 
validity. The tests have been used extensively to eliminate from musical 
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training those whose scores were exceptionally low, or to advise students in 

the choice of musical instruments according to their pattern of aptitudes. 

Several Seashore profiles are shown in Figure 57." 

25. What conclusion can be drawn about an adolescent who does exceptionally 
well throughout the Seashore battery? Should such a person plan on a musical 
career? 

26. An ll-year-old girl takes the Seashore test and does badly on pitch and 
rhythm. What assumptions are being made if the psychologist advises her 
mother that she should not take piano mssons? 

27. What aspects of musical ability are most likely to account for the difference 
between a gifted conductor and a merely adequate conductor? Can such abil- 
ities be tested? 


DIAGNOSTIC TEST BATTERIES 

The desire to use a few dependable tests for guidance in a wide range of 
situations has led to the development of several diagnostic batteries for meas- 
uring abilities. A diagnostic battery is a set of tests having comparable norms 
and usually organized in an easily administered series with uniform style. 
Such batteries may be contrasted to collections of tests prepared by different 
authors and standardized on different populations, on which counselors have 
previously been forced to rely. 

Table 45. Batteries for Differential Testing of Abilities 


Tesi and Publisher Age Range Scores Reported 


Special Features 


DifTerenlial Aptilude Tests Grades 
(Psych. Corp.) 8-1 2 


Guiiford-Zimmerman Apt!' Adolescents 
tude Survey (Sheridan and adults 

Supply Co.) 


Primary Mental Abilities 5-6; 7-1 1 ; 

(Sci, Res. Assoc.) 11-17" 

Yale Aptitude Battery (Yale College 
Univ.) freshmen 


USES General Aptitude Test Adults 
Battery (USES) 


Verbal reasoning, number, ab- 
stract reasoning, space re- 
lations, mechanical reoson- 
ing, cleHcal, spelling, lan- 
guage usage 

Verbal reasoning, numerical, 
perceptual speed, spatial 
orientation, spallal visual!- 
zotion, mechanical knowl- 
edge 

Verbal, word fluency, number, 
space, reasoning, memory 
Verbal, linguistic, verbal rea- 
soning, quantitative reason- 
ing, mathematical aptitude, 
spatial visualization, me- 
chanical ingenuity 
General intelligence, V, N, S, 
P, clerical perception, aim- 
ing, motor speed, finger 
dexterity, manual dexterity 


Designed for guidance* 
Norms based on large 
numbers of representa- 
tive pupils 

Adapted from Air Force 
tests. Authors hope to 
combine pure tests lo 
predict vorious activities. 
Thirteen more tests 
planned 

Intended for research on 
mental abilities 

For college guidance. Vari- 
ous tests correlate high 
with certain courses, low 
with others 

Designed for placement of 
workers. Not released to 
festers generally. Based 
on extensive validation 
research with workers 


” Various sets of the tests for adolescents have been published. The principal difference is in the length 
and reliability of tlie measures. 

“ By permission from Psychology of mus^, by Carl E. Seashore. Copyrighted, 1938, 
by McGraw-Hill Book Co., Inc. 
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Five rather dili'ereiit batteries are described in Table 45. The tests in some 
sets are selected for a practical purpose, each test being supposed to predict 
certain types of performance. Other batteries, constructed along factorial 
lines, depend more on test purity and logical validity than upon demon- 
strated empirical validity. In due time we may anticipate that pure, psycho- 
logically valid tests can be developed which will have empirical validity 
adequate for vocational and educational guidance. 

There are too many uncorrelated special abilities for us to measure all the 
significant ones with a single battery. One encouraging fact is the finding of 
the U.S. Employment Service that jobs fall into “families” having similar re- 
quirements. Within jobs of broadly similar type and occupational level, per- 
sons qualified for one job also can succeed in other jobs. This means that 
tests might conceivably be validated on job families rather than single jobs. 
Unfortunately, even the best such attempt to date, a general battery for 
clerical workers, is quite inadequate for guidance. Correlations with success 
on different jobs are all positive, but often much too low. Correlations with 
criteria include .64, comptometer operators; .39-.48, coding clerks; .20, hand 
transcribers; .12, bookkeeping-machine operators (28, p. 154). Evidently 
each job has large specific elements. Conditions under which correlations 
under .50 have practical value are discussed in the next chapter. 

28. In the extensive Minnesota investigations of mechanical ability, Paterson and 
his co-vvorkers assumed a "theory of unique ti-aits” which they describe as fol- 
lows (20, p. 14) : 

“According to this theory, the difference between individuals of two occu- 
pational classes can be expressed in terms of quantitative differences in a 
few traits. The difference between a machine tender and a farm hand, for 
example, is a matter of differences in a few particulars such as physical 
strength, ability to resist monotony, scope of attention, mechanical aptitude, 
and intelligence. Similarly, these and other capacities may serve to distin- 
guish the machine tender from the tool maker, the tool maker from the 
foreman. . . .” 

Under the theory, one would identify the limited list of unitary trails, make a test 
for each, and analyze each job to see which traits it demands. How well does 
this theory hold up, in terms of present knowledge about aptitudes? 
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CHAPTER 1 1 


Problems in Prediction: Prognostic Tests 

PROCEDURES IN PREDICTION 

When we must predict success in task X, we would like to look in a pub- 
lisher’s catalog, find a test labeled “Test of Aptitude for Task X,” and begin 
using that test for selection. Unfortunately, the procedure required to estab- 
lish a prediction program is much more complicated, because of faulty label- 
ing of tests and the difiRculties inherent in practical prediction. As previous 
chapters have shown, tests of intelh'gence measure diflFerent tilings, tests of 
mechanical abilities vary widely in content and psj'chological meaning, and 
even narrow subdivisions of ability may be represented by several tests 
which do not correlate highly with one another. Not all tests claimed to meas- 
ure aptitude in a given area have been tested for empirical validity, and no 
one knows how well any test will work in a particular practical situation xmtil 
li e tTfeH rout. 

The counselor, employment manager, or administrator setting up standards 
for screening or providing for individual difiFerences can a ccept no test on 
face value. As will be seen, he c annot accept a test solely on the basis of re^ 
search conducted elsewhere . So oner or later, nearly every test worker must 
make validity studies to determine wEettief his’ prediction methods are work- 
jngi While the practicing tester may limit his studies to relatively simple 
checking, it is important to know the full procedure for validation research, 
since this establishes the basic logic of any study of prediction.^ 

To establish a testing program for prediction one chooses a number of 
te.sts for trial, determines their effectiveness experimentally, and devises a 
plan fo r using test scores in making decisions. One p rocedure relies on crude 
trial and error; the e^erimenter assembles a “shotguET baTfery of all kinds 
of tests in tlie lippe that one or more of them \vill proye effective. This method, 
was more common in earlier years and is declining as we understand better 
why some tests are valid and others are not. Psychologists developing test 
batteries for guidance or selection today devote considerable thought to the 
characteristics of the job and the establishment of adequate criteria, as well 
as to the search for promising tests. 

The stages in prediction research are as follows: 

^ The ensuing dLseussion is centered around employee selection in industry, primarily 
to simplify the explanation. The reader can paraphrase the procedures to apply to mili- 
tary, school, or clinical problems. 
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1. Job analysis, to determine what characteristics make for success or 
failure. 

2. Choice of possibly useful tests for trial. 

3. Administration of tests to an experimental group of workers. 

4. Collection of criterion data showing how the experimental group of 
workers succeeded on the job. 

5. Analysis of relation between test score and success on job, and in- 
stallation of most effective selection plan. 

Job Analysis 

The first stage in establishing a prediction or selection program is to an- 
alyze the job to be predicted. This analysis attempts to determine what abili- 
ties and habits contribute to or limit success in the job. No machine-like pro- 
cedure of checking oft one by one all possible factors has ever been found 
successful. Instead, the psychologist studies the task with whatever insight 
and psychological knowledge he can muster. Analysis is in large measure.an 
art. In this stage, one is looking for hunches worth pursuing by formal re- 
search methods. No prediction research can establish relations between tests 
and performance unless the experimenter is capable of forecasting what tests 
will be worth trying. The great number and variety of tests discourages a 
blind tilal-and-error procedure. 

One freqiient^ approach to job analysis is to compare good and poor em- 
ployees. Simple studies of such men often permit one to define the essential 
difference between good and bad performers. Such evidence does not, how- 
ever, directly justify using tests of this ability for selectioUv^Ice the good 
"workers might have acquired their superior characteristics on the job after 

-hiring-. 

In order to make a successful analysis, one must first of all have wide back- 
ground in psychology. Understanding of motivation, motor habits and the 
organization of abilities, and knowledge of the multitude of tests now avaff- 
able for use are required. Detailed motion analysis will reveal what dexteri- 
ties or coordinations are important. Analysis of the stimuli to which a worker 
responds may suggest that certain specific perceptual or sensory abilities are 
needed for success. S tudy of workers in training is helpful, since their_dif- 
ficulties in learning may show what aptitude is needed to avoid failure. 
Stuches of prediction for other jobs are helpful, since they draw attention to 
"tests worth hying, and sometimes suggest that certain tests may be elim- 
inated without further trial. No routine or stereotyped approach is likely to 
be successful, however. The analyst must take off from the experience of ntb - 
ers, but unless he brings in new hypotheses he is unlikely to find a better 
method of predicting than his predecessors. Tire history of testing shows 
many instances in which psychologists, merely trying over the same tests 
others had used before them, overlooked major aspects of tire job which 
were not tested by the familiar instruments. 
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farm cs- 2 i 1 

IMV. 


WAR MANPOfER COMMISSION 
BUREAU or MANPOWER UTIUZATIO 

WOPKER CHARACTERISTICS FORM 


i]ifraT»l Bzplrat r«b.8Sili46 


J<J,T1 .U louv er CI^NER scaui.K.. - Example 

Indleats thb aneuat of oach eheraetoriatle roqairect of the eerkor ia ordor to do tfa« job Ratiafoeto- 
rily by putting an X in the appropriate coluan. Fellovieg ore the defiaitiono of oBcb leeol: 

0 • The eharaeteriatic ii not required for satiafoetory pcrforaoDCe of the )eb. 

C > A nediUB to very low degree of the eharaeterxetie la required in robo elaaint or eioBoats of 
the Jobi 

B - An above'aversge degree of the eharaeteriatic la required, either in nuBoroua alaaeBte of the 
}ob or in the aajor or aoat nkilled eleaeat. 

A - A very high degree of the eharaeteriatic ie required la eoae eieneat of the Job. 

When In doubt betueea A end B< rate B: ehea In doubt beteeea B and C, rate B> *hen in doubt bateeaa 
C and 0« rateCf If aoae eheraeterietie not on thia liat la required, write it In, rata it, and da> 
fine It briefly at the bottoa of the fora. 



urrmiTiOHS roh AoatTioHai. cHARiCTinxaTica: 


t>eao9 ba*er 

Fig. 58. The form has been filled out for the occupation of louver cleaner. 
A louver cleaner cleans and adjusts electrolytic cells used in magnesium proc- 
essing. (Courtesy U.S. Employment Service.) 

A job analysis should be highly specfficj. One s hould not state that succes s- 
ful w orkers jiave "mechanical ability”; one should instead define t he ability 
as "knowledg e and ability to apply principles of gears,” or "speed in ro utine 
tyifn-handH-mnnipidatinn, not involving much finger dexterity or thinfclng.” 






240 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


Such breakdowns permit one to try the most appropriate test. Tlie analysis 
should include as many characteristics as can possibly be detected in the 
worker. It is certain that some of these will be minor, and may not be im- 
portant enough to test experimentally. Having a full analysis, however, per- 
mits the investigator at any time to reconsider his basis for prediction and 
introduce some of these additional elements. The analysis should not look 
only for “mental abilities.” Instead, it should range over the entire field of 
abilities, habits, pei'sonality characteristics and interests, previous experi- 
ences, knowledge, physical characteristics, and so on. 

The worker characteristics form of the U.S. Employment Service is de- 
signed to record the characteristics most often found in job analysis. Nat- 
urally, additional job specifications must be added for individual jobs. The 
form shown in Figure 58 has been filled out for the occupation, louver 
cleaner. This rating of characteristics is based on the judgment of a field 
analyst and is subject to experimental validation through the use of tests of 
the abilities rated A or B. Although traits are listed on the fomi in generalized 
terms, they are defined in tlie analysts manual, e.g., “Dexterity of fingers: 
ability to manipulate small objects by rapid and/or accurate movements of 
the fingers” (20, pp. 175-183). 

1. Which of the characteristics listed in Figure 58 may actually be composites 
including several abilities, in terms of the findings reported in previous chapters? 

2. Prepare a list of the probable factors composing aptitude for one of the follow- 
ing jobs: making pie dough, operating a calculator of a particular type, taxi 
driving, or school teaching of some one type. 

3. For many jobs rcquii'ing long training, e.g., physiotherapists, it is undesirable 
to take girls into training who will probably marry and drop out. What char- 
acteristics might distinguish between probable marriers and non-marriers? 

Choice of Tests for Trial 

Knowing the characteristics important in a job, the investigator must then 
find tests to measure each. He must make a choice between seeking QueJest 
which i s a composite of the job requirements and seeking a group of tests, 
eachjQf which is a pure and independent measure of one of the character- 
istics. The former method, which usually l^ds to te^ts o£-the joh-replica-type, 
is smtable pnljirif the inyestigator is willing to go to the trouble of .designing 
a new jest for the job. The composite test has distinct disadvantages: 

1. In employee placement or in guidance it is more economical to use a 
few tests which give information about many jobs than to have a separate test 
for each job. 

2. If one function of testing is to place the worker in the most suitable job 
of several available, a test which includes characteristics of more than one 
job will be non-distinguishing. 

3. Work samples must be, revised, restandardized, and revalidated when 
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any change in the nature of the job is made. A battery of tests may often be 
revised to fit minor changes in the job by altering or adding only one test. 

Despite these objections, the complex work sample usually is better than the 
pure test, because well-designed job replicas give the best validity coeffi- 
cients. 

Assuming that the investigator decides upon using many tests, each for a 
particular function, he must then choose between using available tests and 
consti'uctihg new ones. Insofar as the job under consideration uses abilities 
or traits already measured in published tests, such tests should be tried. 
Naturally, not every test with a relevant name will be suitable; the investi- 
gator must consider the difficulty of the test, its appropriateness to the intelli- 
gence and education of his subjects, and the like. If the job calls for an abil- 
ity only partly similar to available aptitude tests, it is more desirable to make 
a new test to measiue the ability the job requires than to obtain a pale dis- 
torted image of it from a less direct measure. Without condemning the useful 
Bennett Test of Mechanical Comprehension, we can use it to illustrate tliis 
point. The items measure some general factor, but there are also group and 
unique factors among its items, which are drawn from all portions of physics 
and mechanics. To select men for advanced electrical training, background 
and comprehension are significant; but it is probable that Bennett’s items 
dealing with electricity will give better correlations than his items on forces, 
motion, and buoyancy. Inclusion of the latter items might, in fact, dilute the 
test so that it will fail to select good workers, whereas the electrical items 
alone, or a longer test of such items, would predict adequately. 

IMless there is a close psychological correspondence between an available 
test a^d the jQb» a jae.w test of the ability must be constructed. Even if the 
test items are appropriate, the structure of the test may make it unsuitable. A 
timed test, of excellent items, might predict badly on a job in which speed 
was insignificant. The solution is to use the test but to discard the author’s 
time limits. If a job calls for bimanual speed, perhaps packing small objects 
with both hands, it would probably be sounder to restandardize the Minne- 
sota Manipulation test for two hands than to use the standard (one hand) 
directions. 

There is no simple ansvver to the question of published versus homemade 
tests. -T ests with a wide range of- usefulness are of distinct advantage in edu- 
cation nl g ui danc e- and to a lesser extent in employment work. Counselors 
^woiHd prefer to predict all jobs with a few tests. But the test which fits a spe- 
cific job usually has significantly higher validity than the test for general use. 

4. The Army and Navy tended to employ classification batteries made up of 
tests of general aptitudes likely to be found in several jobs, while the Army 
Air Force screened its candidates through tests many of which were related 
only to one particular aircrew job. Explain this difference in terms of the clas- 
sification problems of each service. 
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5. What practical conditions would a department store consider in deciding 
whether to make a special test for each type of clerk or to use a published 
test for salespeople in general? 

Experimental Trial 

The crucial step in prediction research is experimental trial of the instru- 
ments chosen after job analysis. One must give the tests to typical applicants 
and observe the correspondence of test scores to success. In practical work 
there is unfortunately much pressure to do without experimental study. 
When the psychologist reports to his boss that he believes test D will elim- 
inate poorer employees, the boss is far more anxious to install the test and 
benefit from it at once than to withhold judgment during weeks or months of 
study. Full experimental trial is indispensable. No hypothesis can be trusted 
until tested, because there have been many instances in which “likely” tests 
proved to be of no value in selection. The non-psychologist may propose to 
use the test to eliminate 25oor men, and to conduct researdi on the survivors 
to determine the relation between test and performance. A good test, how- 
ever, might not predict which .pf_the acceptable men would do well on tlie 
job, even though it could weed out the failures. (Examjjle: An audiometric 
test would rule out some people as music students; but within a selected 
groujD, all of whom could hear, it would not predict success. ) It will become 
clear later, too, that only trial on an imselected group can accurately estab- 
lish critical scores, weightings of te.sts, and correlations. So imf)ortant is ex- 
perimental trial on an unselected population that the AAF went to the trou- 
ble to validate its selection methods by sending 1300 men, a random samjple 
of all eligible soldiers, through training for flying, even though it knew in 
advance that the majority of these men would be failures (8, pp. 181-258). 

The tests should be given with the same motivation that would exist in 
their practical use. One will tiy more tests dian he can use in his final predic- 
tion batteiy, since some will probably not be helpful and will have to be 
eliminated. This makes the trial battery long in many cases, so that special 
care must be given to maintaining cooperation from the subjects. Sometimes 
one test can be studied at a time, but sooner or later the entire selection bat- 
tery must be vahdated on a single group. 

The Criterion 

After having given his tests, the experimenter waits for the emergence of 
evidence of good and poor job performance. The experimental group is 
treated in the same way as other workers, being given normal training and 
duties, ^ter a suitable interval, data on success are obtained. Among the 
ca-iteria pften used are quantity of production, quality of production, turn- 
qvCTj and opinions of foremen or supervisors. As was explained on p. 56, it 
is important that the criterion possess a high degree of validity. A test which 
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can predict quality of work will seem to be a poor test, if it is judged by a 
criterion which does not fairly indicate quality of work. The criterion must 
cover all irt^ortant aspects of the job. 

Criteria may be based on output, field obser\'ations, or ratings. The ci'i- 
terion must have high reliability. An adequate number of observations, rep- 
resentative of normal performance, is required. Acliievement tests may be 
the criterion, in which case they must meet the usual requirements of ob- 
jectivity, stability, and validity. Ratings are particularly common criteria and 
are subject to many errors. Methods of making ratings more dependable are 
discussed in a later section (pp. 397 ff.). 

It is frequently impossible to obtain a single perfect measure of success. 
One reason for this is that different workers seem best at different times. 
Sometimes the fastest learner, who does well at the end of a short training 
course, does not give as good final lesults as a learner who continues to gain 
in ability after he starts work. Sometimes the worker who possesses good 
learning ability, and so makes good grades in training, lacks temperamental 
qualities for ultimate success on the job. Sometimes the requirements for 
success are not the same for all woi'kers in the same field. Teachers, for ex- 
ample, may be successful in different ways: One may develop into a friend 
and counselor for youth; one may stimulate independent and courageous 
thinking in the few brightest pupils; another may have excellent results in 
overcoming the blockings that cause failure among poor students. No one of 
diese teachers is best, but all are necessary types. It is impossible to find a 
single criterion that is adequate for comparing these different types of teach- 
ing success. 

More and more attention is being given to establishing many criteria for 
success in the same job. This is particularly important for high-level jobs; 
there are a great many patterns of success among oflBcers, executives, consult- 
ing engineers, or artists. If it is possible to predict the particular strengths 
and weaknesses of an applicant, the employer lias an excellent basis for 
placing him in the particular responsibility where he will be successful. This 
is exemplified in a study, by the writer, of the success of students seeking 
Doctor’s degrees in education. In this already highly selected group, pre- 
liminary results show no way of predicting success or failiue in doctorate re- 
search in general. But appropriate methods do identify specific character- 
istics, which knowledge may be used to help the student set up a research 
problem on which his aptitudes and personality assets will be helpful. One 
of the few completed studies using more than one independent criterion is 
that of Otis, reported on p. 223. 

6. List several independent (non-duplicating) criteria which might be used to 

evaluate teacher success. 

7. List several independent criteria to consider in judging branch managers of 

an equipment firm. Branches are responsible for both sales and service. 
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A striking study by Lennon and Baxter inverts the normal procedure and 
determines what aspects of the criterion can be predicted by available tests 
( 13 ). Eacli item on a 90-item check list applied to clerical workers was cor- 
related against a revision of Aimy Alpha and an aptitude test requiring 
alphabetizing, number checking, coding, digit counting, computation, and 
reading of tables. For some aspects of job performance, correlations with the 
predictors were high, but other qualities were not predicted. Check-list items 
dealing with understanding of the work, quantity and speed of work, per- 
formance of multiple tasks, and avoidance of duplication of effort were well 
predicted. Items dealing with errors in work, typing, shorthand, grammar, 
orderliness, and “personality” were not predicted successfully. Some of the 
results are shown in Table 46. This study shows why it is difficult to predict 
such a composite criterion as. “supervisor’s rating of all-round performance.” 

8. What procedure might be suggested for selecting clerical workers, in view of 
Lennon and Baxter’s findings? 


Table 46. Percentage of Office Workers Having High and Low Aptitude Scores 
Rated as Having Particular Characteristics (13) 



Learning Ability Test 

1 Clerical Aptitude Test 


High 27% 

Low 27% 

High 27% 

Low 27% 

Characteristic 

(N = 58) 

(N = 58) 

(N = 58) 

(N = 58) 

Hls working insiruciions have to be repeated fre> 





quently 

7 

12“ 

5 

5 

Has mode helpful suggestions about work handled 
Often does necessary but unrequesled work on his 

31 

29 

38 

22“ 

own Initiative 

37 

26“ 

39 

25“ 

Checks his work for errors before releasing it 
Sometimes forgets motters which should receive 

51 

45 

52 

48 

prompt attention 

5 

7 

7 

7 

Is Inclined to sacrifice accuracy for speed 

4 

3 

6 

5 

A Difference between low and hl}>h group is 
chance in sampling. 

probably n true 

difference, : 

rather than 

the result of 


When records have been collected to show which workers are most suc- 
cessful, the final procedure in a selection study is to process the data and de- 
termine which tests were the best predictors. Before discussing the detailed 
processes of analyzing prediction data, it will be best to see tlie entire re- 
search process in perspective by examining an actual study. 

Development of on Aptitude Test 

One use of the procedures just outlined is to develop prognostic tests. In 
earlier chapters emphasis was placed on tests likely to have value in making 
a wide range of predictions. Because there are many jobs which these meas- 
ures of basic abilities do not predict, it is also necessary to design special tests 
for specific jobs. The development and characteristics of prognostic tests are 
well illustrated by the E.R.C. Stenographic Aptitude Test. 
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Prediction of success in learning shorthand is of importance because many 
girls undertake the study and quite a few fail. A test which could be given 
before training would save time and money spent in stenography courses, 
and permit ghls to select a more appropriate vocation. Moreover, available 
tests of general intelligence and other abilities have never been effective in 
predicting failures in shorthand. Deemer therefore decided to develop a new 
test, geared specifically to the problem of predicting success in learning 
stenography. The abilities to be built into the test would be those important 
in stenography, even if they were of no significance anywhere else in busi- 
ness or education. 

A job analysis was made. The analysis was based, in this case, upon a 
study of the shorthand systems and the nature of the job, rather than upon 
observation of stenographers. The resulting list of abilities was as follows 

( 7 ): 

“During dictation: The more efiScient stenographer will probably be superior to 
the less efiScient in: 

“1. Ability to listen to what is being said during dictation, i. e., facility with which 
she attaches meaning to each word dictated. 

‘‘2. Ability to write correct outlines fluently and rapidly. 

“3. Ability to hold a number of words in mind while writing others. 

“4. Ability to be ‘behind the dictator’ without becoming flurried. 

"5. Knowledge of symbols for complete words. The less efficient stenographer 
will have to compose more outlines sound by sound during dictation. 

"6. Thoroughness in checking, during pauses in dictation, the outlines just written. 
“During transcription: The more efiScient stenographer will probably be superior to 
the less efiScient in; 

“1. Ability to judge from the length of her notes where to begin the letter on 
the page. 

“2. Ability to produce letters which are neat and clean. 

“3. Ability to read the outlines she has written. 

a. To call up the word or words for which an outline stands, either by recogniz- 
ing the outline as a whole or by deciphering the outline sound by sound. 

b. To choose, when necessary, the word that fits the context. 

“4. Ability to spell the words. 

“5. Ability to type the words accurately and rapidly. 

“6. Ability to judge how far ahead to read before beginning to type."® 

This list of abilities was reviewed in the light of previous studies of short- 
hand aptitude. It was reduced by eliminating aptitudes which all girls might 
be expected to possess to an adequate degree, by eliminating tliose abilities 
which would be learned in training, and by combining some abilities. The 
preliminary forms of the test were then designed, using the following ex- 
ercises: 

^Froni W. L. Deemer, Jr. E.R.C. Stenographic Apttttide Test, Manual, p. 1. Repro- 
duced by permission of Science Research Associates. Copyright 1944 by Walter L. 
Deemer. 
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I. Speed of writing. Girls were required to copy the Gettysburg Address in 
longhand as rapidly as possible. This may be con.sidered a complex coordina- 
tion test, duplicating many motor elements of shorthand witing. 

II. Word discrimination. In this test, gii'ls choose which of two words best fits 
a particular context, as in “We are satisfied that our (personal personnel) is 
completely loyal to the firm.” This simulates the problem in shorthand of 
choosing the correct word when an outline fits more than one word. This is a 
complex intellectual action involving verbal intelligence, vocabulary, and 
spelling. 

III. Plwnetic spelling. Girls are to write, coirectly spelled, the words represented 
phonetically by oshen, akshn, vejtahl, bleef, etc. This simulates the problem 
in shorthand of recalling the entire word from a phonetic symbol. 

IV. Vocabulary. The subject is required to choose the best synonym for a word 
presented in a sentence, e.g., “His associates [enemies, employees, colleagues, 
pets, relatives-by-marriage] fear him.” 

V. Sentence dictation. In this test, the tester reads aloud sentences of varying 
length which the subjects take down. The sentences increase in length, so 
that the subject must eventually carry many words in mind. The foUowing 
sentence, for example, is read in 24 seconds, and 48 seconds more are allowed 
for girls to finish wilting it; “Because of your error in filling our last order we 
were unable to meet tlie demand of our customers for holiday flowers and thus 
lost money and good will.” 

The next stage in the study was to try the preliminary forms of the test. 
Some items proved ambiguous, too easy, or too hard and were removed. 
Thus, in Test V, sentences of 15 words or less were found to be so easy as 
to have no discriminating power and were dropped. The final form of the test 
for validation was then prepared. 

Validity was determined by administering the test to 500 students entering 
shorthand classes. During the next two years, various measures of acliieve- 
ment were collected. Table 47 reports validity coefficients obtained for the 
total test score. These validity coefiicicnts are high enough to justify using 
the test at least to identify girls likely to have difficulty in the course. Since a 
coefficient of .65 means that many false predictions will be made in individual 
cases, a school may prefer to use the test to point out those who should have 
special attention from the teacher rather than arbitrarily to bar girls with 
low scores from trying shorthand. 

Table 47. Correlation of Scores on E.R.C. Stenographic Aptitude Test 
with Various Criteria, for 500 Shorthand Students (7) 


Criterion Correlation 


Accuracy of transcription after one year of study: dictation at 60 w.p.m. or less .54 

Accuracy of transcription after two years of study: dictation at 80 w.p.m. or less .65 

Accuracy of transcription after two years of study: dictation at more than 80 w.p.m. .70 

Accuracy of transcription after two years of study: dictation of material, the short- 
hand outlines for which had not been studied (80 w.p.m.) .58 

Accuracy of transcription after two years of study, shorthond notes being tran- 
scribed two weeks after dictation (90 w.p.m. or less) .65 

Rate of transcription at end of two years of study .35 
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9. What abilities listed in the job analysis are not represented in the final test? 

10. What forecasting efiiciency coiresponds to a validity coefiGcient of .70? 

11. Says Deemers manual, “No reliability coefficients are reported for this test 
because it is felt that they add nothiiig to the reported validity coefficients. If 
the validity coefficient is satisfactory, the reliabiliW coefficient must be satis- 
factory. ’ Although the latter sentence is defensible, one sometimes wishes 
reliability data also. What important questions about Deemer’s test would be 
answered by reliability coefficients for subtests and total score? 

12. What explanations can be offered for the failure of the validity coefficient to 
reach 1.00? What does this imply regarding ways to improve the prediction in 
further studies? 

Raw Data 

Number of Trainees 
Men Men 

Later Later 

Quali. Failed 

fied 

N=216 N=126 

144 31 

31 16 

23 16 

4 17 

7 20 

4 16 

3 5 

0 5 

0 0 


Saving in man-doys resulting from 
elimination after 26th trial; 



Men trained 

Men trained 

Men 

Man-days 

Man-days 


for 3 days 

for 45 days 

qualified 

per finished 
operator 

wasted 

through 

failure 

Without screening 

342 

342 

216 

76 

6048 

With screening 

342 

261 

198 

65 

3267 


Bi-serial correlation, test and criterion' 
64 


Effect of cutting score to eliminate 
poorest 81 men: 

Men above 70 .iii.'mluhM I 

Men below 71 

ghftssssa Successful 1 I Unsuccessful 


Percent 
of Work 
Correct 
on 25th 
Lesson 
(3rd Day) 

91-100 

81-90 

71-80 

61-70 

51-60 

41-50 

31-40 

21-30 

0-20 


Fig. 59. Three methods of reporting the effectiveness of a proposed selection 
plan for Morse code receivers (data from 11). Men were given repeated lessons 
in code-voice training. It is proposed that men having difficulty on the twenty- 
fifth trial, on the third day, be dropped from the program, which requires a 
total of 48 days. 

TREATMENT OF DATA IN PREDICTION STUDIES 

There are many methods of organizing data from predictioii.,studies. The 
reader who surveys the literature will be impressed with the variety of re- 
ports used, but dismayed by the di fficult y of comparing results from studie^ 
presente r! in diffe rent way«j. Often an investigator deliberately chooses the 
f orm of report which, whil e toithful, places his selection battery in the most 
favorab le Hght. The reader must be able to compare different methods of 
reporting similar data, to avoid being misled by dramatic rephrasing of the 
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same facts. The widely used methods of reporting prediction studies are as 
follows: 

1. Correlation. A test having a high correlation with a criterion is a useful 
predictor; one vvitli low correlation usually is not. 

2. Screening effectiveness. Such a report indicates the number of successful 
ennployees earning good and poor scores on the selection test, compared 
to the number of unsuccessful employees earning similar scores. 

3. Benefits. In this method, attention is focused on the net gain, in dollars 
and cents, from employing only selected men. This method usually com- 
pares the turnover cost of failing employees and superior employees, or 
the labor cost per unit produced by each group. Where man-hours are 
precious, savings may be reported in tliose terms. 

All of these methods are legitimate, and in business the third may be the 
only relevant indication of the merit of a selection plan. On the whole, how- 
ever, correlation is a preferable method for reporting validation research. It 
has the advantage of permitting comparison of the results of different studies7 
and of giving a figure which can be jjlaced on a scale from zero to perfect 
forecasting cfiBciency (p. 39). The second method has tire advantage of easy, 
interpretarion and is more suitable tlian simple correlation wherever there 
is a curvilinear relationbetween test and criterion. Special problems in this 
method of analysis are discussed in Chapter 14. 

Multiple Correlation 

Where a single correlation is to be predicted, one seeks the tests having 
the highest correlation with the criterion. Better prediction is obtained by 
basing the prediction on several tests, since the job usually calls for many 
abilities. Multiple correlation is a statistical technique for determining how 
to combine tests to get the best possible predicti on and to est imate what cor- 
relation the composite score will have with the CTitecion. Formulas for mul- 
tiple correlation are given in most statistics texts. 


Table 48. Effect on Multiple Correlation of Adding Tests 
to Battery {17, p. 83) 


Tests 

Correlotiot) with 
Criterion (Shop 
Performance of 
Junior-High-Schooi 
Boys) 

Muitiple Corretation 
of Criterion with 

First Tost, First Two 
Tests, First Three 
Tests, Etc. 

Poper Form Board 

AZ 

.43 

Sfenquist Assembly 

.26 

.44 

Steodineu 

.29 

.53 

Card sorting 

.27 

.563 

Topping 

.18 

.580 

Link's Spatial Relations 

.36 

.594 

Packing blocks 

.28 

.5953 
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To obtain a high multiple correlation, tests are sought which have a posi- 
tive corrdation with the criterion and low correlations with each other. There 
is value in combining tests of the same ability; this is merely equivalent 
to ma king the original test more reliable. But if a new test measures a com- 
ponent of the job-criterion not estimated by the first test, it will improve the 
multiple correlation appreciably. The example in Table 48 shows not only 
that prediction improves when we combine several tests with low validity. It 
also shows that the multiple correlation reaches a ceiling very rapidly, so that 
ad din^ tests beyond the first three or four is rarely valuable. More elaborate 
prediction batteries are worth while only when each added test measures a 
new factor. 

OncT can often afford to discard tests from a trial selection battery even 
though they have positive correlations with the criterion. A study with tlie 
MacOuar rie test, which has seven subtests, illustrates this point. The test was 
used to predict ratings of radio assemblers in training. The best correlation 
obtainable by weighting all seven subtests was .46. By combining the four 
best subtests— Tracing, Location, Copying, and Blocks— a multiple con^-ela- 
tion of was obtained. The four best tests are as useful in this prediction as 
the entire set requiring about twice as much time ( 10 ) . 

A third illustration may be given to show how little value there is in ex- 
tending a battery by adding even reliable tests, if they duplicate abilities 
already measured. The following correlations between tests and elimination 
from flying training were found by the AAF ( 8, p. 194) : 

Pilot stanine (i.e., composite score on selection battery) .653 
Stanine plus Qualifying examination .655 

Stanine plus Qualifying plus General Classification Test .655 

Critical-Score Method 

In contrast to the prediction formula by means of which one sets up a 
comp osite score on all tests, the critical-score technique is much simpler. A 
critical score, or cutting score, is the score at which the number of probable 
laiiu^ becomes so high that persons below that level are poor risks. To use 
^ more than one predictor, a multiple screen is set up, so tiiat a person falling 
belowfhe critical score on any test is ehminated. 

' ‘ E’stablishmehf of a critical score is illustrated by the data in Figure 60. The 
investigator here obtained IQ’s from die Terman Group or Otis Self- Ad- 
ministering mental tests, which are nearly equivalent, for 1146 pupils in one 
high school. Over a period of years, he accumulated records of their passing 
and failing, as plotted in Figure 60. From these data, one may advise an en- 
tering pupil whether he has a good chance of passing the course. If one 
chooses to screen out pupils who have one chance in three of failing, IQ 92 
will be used as a cutting score for admission to algebra, and IQ 80 for ad- 
mission to English. 
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The peculiar effectiveness of the critical-score method is shown by TiflBn’s 
data on paper-machine operators (23). The Bennett test was only a fair pre- 
dictor, judged by the correlation of .47 with foreman’s rating. But the data 
shown in Figure 61 indicate that the test is quite effective in one way: Any- 
one who is very bad on the test is very bad on the job. Evidently, mechanical 
comprehension is a necessary but not sufficient condition for being a good 
operator. A cutting score of 30 will eliminate five very poor men, at no cost in 



Fig. 60. Relation between IQ and probability of failure in 
English and algebra (14). 


good men. Any higher score begins to take a toll of both good and poor men. 
(In practice, more cases are required to establish a cutting score than are 
available here. ) 

13. What cutting scores would be used if we wished to admit pupils to classes 
when they have an even chance of passing (Fig. 60)? 

14. What assumptions are made if cutting scores based on Figure 60 are applied 
in other schools? 

15. What effect would raising the cutting score for code learning to 90 have on 
the screening effectiveness of the proposed elimination plan (Fig. 59)? How 
many man-days would be saved? 

16. Under what circumstances is it practicable to use a severe cutting score rather 
than a lenient one? 

17. Determine the median rating of men falling in each interval, 0-30, 31—40, 
41-50, 51 and over, from the data in Figiue 61. What effect would a cutting 
score of 50 have on the quality of employees hired? 

Results of screening are often described, in terms of “misses” and “false 
.positive.” 'The aim in sca:eening is to detect probable failures. Since one is 
looking for symptoms of failure, the medical terminology of “false positive” 
applies when one labels , a. person as a probable failure though he would 
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not actually fail (see 
Fig. 62). if the screen- 
ing ^system accepts a 
man who later fails, that 
case was a “miss”ior the 
syste m. The best selec- 
tion system reduces both 
false positives and miss- 
es, but a deep cut leads 
t o many false positives 
an d fe w misses, while a. 
shallflw gut is likely to 
jncrease the number of. 
misses. Which type of 
cut is best depend? upon 
the reTa.ti.ve seriousness 
of .false positives and 
m isse s, and whether we can afford to discard many applicants in processing. 

18. Which would be preferable, false positives or misses, in each of these situa- 
tions? 

a. Candidates for admission to teacher training are screened for ability. 

b. It i.s important to hire skilled sheet-metal workers to fill vacancies, during a 
time of tight labor supply. Men cannot be trained on the job. 

c. In inducting soldiers, screening is used to determine which should be 
checked individually by psychiatrists, 

d. A company wishes to hire mechanics and put them through an expensive 
training program; success cannot be observed until the end of the course. 

In guidance the above plans for screening are modified. Since factors not 

measured in tests must be con- 
sidered to do full justice to the 
results, counselors prefer to use 
tests as a basis for discussion and 
judgment, rather than to employ 
arbitiary cutoffs. M any n erspn- 
nel workers selecting or placing 
workers use tests as a source .of 
information rather than adopt- 
ing cutting formulas; selection is 
based onji.fihaI .subjective judg- 
OienL Some doubt is cast on the 
wisdom of such a selection pro- 
cediu'e, except where personnel 
workers are extremely compe- 

■* Reproduced by permission of Prentice-Hall, Inc., from Industrial psychology. Second 
Edition, by Joseph Tiflin. Copyright, 1942, 1947, by Prentice-Hall, Inc. 
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Fig. 62. Determination of effectiveness of 
a critical score. 
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tent, by results found in Navy classification. Trained enlisted classification 
specialists interviewed each man, having available his test scores, a life 
history, and other data. The interviewer gave a final rating as to the man’s 
probable success in training school. Men sent to electrician’s mate school 
had varying success in the school. A combination of two tests (Electrical 
Knowledge and Arithmetic Reasoning) correlated .50 with success in 
training. Interviewer’s rating, based on tests plus judgment, correlated only 
.41 with success. In other words, judgments departing from the test recom- 
mendation reduced the correctness of prediction (6). This finding was con- 
firmed in Air Force research. 

19. What factors outside the test must ordinarily be considered in deciding 
which of two men with equal test scores should be hired for a civilian job 
comparable to elecWeian’s mate trainee? 

20. A high-school boy has equal scores in two test batteries, one supposed to pre- 
dict success in electrical work, and one to predict success in work with Diesel 
engines. If the tests are equally valid, what non-test fac*tnrs would be consid- 
ered in advising him which type of training to enter? 

Illustrative Results 

When research on validationjs completed, one has a selection program to 
put into operation. Tables 49 and 50 illustrate the selection program installed 
by the Army Air Force, which used a multiple-correlation method, and the 


Table 49. Validity Data and Combining Weights Used in the AAF Classification 
Program of November, 1943 (8, pp. 99, 101) 


Test 

Correlation with Criterion** 

Relative Weight 

Bomb. 

Nav. 

Pilot 

Bomb. 

Nav. 

Pilot 

Printed tests: 







Reading comprehension 

.12 

.32 

.19 

8 

2 

— 

Spatial orientation 11 

.09 

.33 

.25 

— 

10 

5 

Spatial orientation 1 

.12 

.38 

.20 

— 

9 

6 

Dial and table reading 

.19 

.53 

.19 

14 

18 

4 

Biographical data~~pilot 

— 

— 

.32 

— 

— 

15 

Biographical data*~"nav1gator 

— 

.23 

-.03 

— 

9 

— 

Mechanical principles 

.08 

.13 

.32 

— 

— 

8 

Technical vocabulary*~pilot 

.04 

.10 

.30 

— 

— 

13 

Technical vocobolory—nav. 

.04 

.22 

.09 

— 

— 

— 

Mathematics 

.10 

.50 

.08 

— 

18 

— 

Arithmetic reasoning 

.12 

.45 

.09 

8 

12 

— 

Instrument comprehension 1 

— 

— 

.151 




Instrument comprehension It 

— 

— 

.35/ 




Numerical operations, front 

.13 

.26 

.01 

— 


— 

Numerical operations, bock 

.11 

.28 

.02 


... 

— 

Speed of identification 

.09 

.19 

.18 

— 


— 

Apparatus tests: 







Rotary pursuit 

.14 

.10 

.21 

12 


4 

Complex coordination 

.18 

.24 

.38 

12 

... 

17 

Finger dexterity 

.16 

.20 

.1 1 

19 

6 



Discrimination reaction time 

.22 

.36 

.22 

27 

6 

4 

Two-hand coordination 

.12 

.26 

.30 

— 

11 

4 

Rudder control 

— 

— 

.42 

— 

— 

12 


0 The criterion is sraduation or elimination irom training. 
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Table 50. Multiple Cutoff Plan for British Army Classification” (3) 


"Family" of Duties 

Matrix'' 

Bennett 

Mechanicoi 

Compre- 

hension 

Arith- 

metic 

Verbal 

Cleri- 

cal 

Agility‘s 

Olher 

Tests 

1. Driving 

4 

3- 

Any 

Any 

4 

4 

Assembly* 
over 23 

2. Maintenance 

3- 

3-1- 

3- 

Any 

Any 

Any 

Assembly 
over 40 

3. Signaling 

3- 

3- 

3-1- 

3-1- 

3- 

4 

Morse* 
over 50 

4. Construction 

4 

4 

4 

4 

Any 

4 

— 

5. Clerks, storekeepers 

3- 

4 

3- 

3 + 

3- 

Any 

— 

6. Other mobile (combat) 

Any 

Any 

Any 

Any 

Any 

3- 

— 

7 . Other domestic, administrative 

Any 

Any 

Any 

Any 

Any 

Any 

— 


" Each score inclicfttcs the lowest test grade usable in each classification. Grades range from a high of 
I to a low of 5. 

^ A non-verbal test of general ability. 

^ A test of general bodily codrdinatioii and speed. 

^ A test resembling the Stenquist Assembly test. 

• A test of perception of dots and dashes, resembling the Seashore rhytlim test. 

program of the British Army, which used a multiple cutoff plan. The AAF 
tried numerous tests, chosen on the basis of a job analysis. These were tested 
experimentally on successive classes of air cadets, the battery being revised 
as new and better tests were found. The same tests were used to predict suc- 
cess as pilot, bombardier, and navigator, using a different combining formula 
for each job. Table 49 indicates the weighting plan. In selecting bombardiers, 
for example, discrimination reaction time and finger dexterity counted very 
heavily; reading and 
arithmetic had only a 
small weight. Composite 
scores could also be 
computed for such spe- 
cialties as fighter pilot, 
bomber pilot, mechanic- 
armorer-gunner, and ra- 
dar' observer. Figure 63 
shows representative va- 
lidity data, gathered over 
a two-year period. The 
correlation of pilot apti- 
tude rating with gradua- 
tion-elimination was .45, 
but tliis coefficient would 
be higher if men with 
low composite scores 
had not been eliminated 
before entering elemen- 
tary training. 



0 1 0 20 30 40 50 60 70 80 90 10 

Per Cent 


Per cent eliminated I 1 Per cent graduated 

Fig. 63. Elimination of cadets from elementary pile 
training in relation to pilot aptitude score (8, p. 134) 
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The British plan, instead of combining scores into a single composite, em- 
ployed separate hurdles or cutting scores for each test in each division of 
the service. For men who qualified for several classifications, assignment 
was made on the basis of preference and the current needs of the various 
branches. 

21. What abilities counted more heavily in the bombardier-prediction score than 
in the other two predictions? 

22. How do you account lor the different weights assigned the first two tests in 
navigator prediction, in view of their similar validity coefficients? 

23. Which of the three aircrew jobs has the smallest psychomotor component, ac- 
cording to the prediction weights? 

Comparison of Multiple-Screen and Multiple-Correlation Methods 

The essential difference between the two plans for organizing predictive 
data is that one yields a composite score and onp does not. Figure 64 outlines 
graphically the difference in the assumptions made under the two plans. The 
multiple-correlation method has these advantages: , 

1. It indicates the rank, in all-round ability, of men who pass the screen. 
This is useful in identifying men requiring special assistance during training, 
or for singling out superior men for special responsibility. 

2. For a particular man, it permits a comparison of his probable success in 
various specialties, instead of merely ehminating the assignments in which he 
would fail. 

3. It permits combining the tests in that proportion which gives the highest 
correlation. Prediction is therefore more accurate than with a multiple cutoff. 

4. It yields, in the multiple coirelation, a simple estimate of the efficiency 
of prediction from the test battery. The formula also indicates the contribu- 
tion of each test to the final prediction. 

The multiple cutoff method has these advantages: 

1. It does not assume that strength in one ability compensates for in- 
adequacy in another important ability. 

2. It is easier to compute and easier for the layman to understand than a 
composite formula. It is usually easier to administer. 

3. Retaining the scores of separate tests in the record permits more ef- 
fective guidance or placement than an undifferentiated composite or average. 

The multiple .cutoff plan is more compatible with recent thinking about 
J^^YsTue of multiple criteria. If a person can succeed on a job in different 
ways, it need not be assumed that his special strengths and special weak- 
nesses cancel each other. In selecting potential symphony conductors for 
training, most people would agree that intelligence and pitch discrimination 
are requisites. But if tests were used, a battery combining these two pre- 
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Suppose an intelligence test and a coordination tesi both predict a certain job. as indicated 
by the following data: 



Intelligence Coordination 


o Good men • Failures on job 


Then suppose six new applicants are being ron.sideri‘d. whose scores ore ns follrrws: 


art— — 


IVIulliplG-screen method of selection 

The scatter diagram shows that men with IQ 80 or 00 tend to fail on the job. Also, men with 
coordination of 20 or 25 tend to fail. It is therefore decided to screen out all applicants with 
IQ below 100, or coordination below 30. 

This is how the six men are judged by this method : 


IQrt^lOO IQaIL 110 
CoSfd. 30 CoVrd. 35 


Multiple-correlation method of selection 

The statistical computations of this method set up a prediction formida for combining 
the scores. In this problem the formula might he IQ + h tines Coordination. 

This formula is applied to each man, with these results; 



Combined 

Score 

160 


Combined 

Score 

220 


Combined 

Score 

190 


Combined 

Score 

250 


The men with the poorest composite scores are eliminated ; 



Combined 

Score 

220 


Conllfined 


Combined 

Score 

250 



When the fourth man is hired, it is assumed that his superior coordination makes up for his 
lack of intelligence. On some jobs this assumption is unsound. 

Fig. 64. Comparison of multiple-screen and multiple-correlation methods. 


dictive scores would pass an e.xceptionally brilliant thinker who could not 
tell when the musicians were in key. 

At one time during World War II, the Navy desired to train men to op- 
erate anti-submarine listening gear. By the usual research procedure, psy- 
chologists established a prediction formula: Men were screened on general 
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intelligence; and within the surviving group, predictions were based on the 
average of the Bennett Mechanical Comprehension test and several Seashore 
tonal tests. Following standard Navy procedure, acceptable men were sent to 
training school, and those who failed in school were assigned to general sea 
duty as apprentice seamen. It was therefore a serious matter when a man of 
good intelligence was sent to a school for which he was unqualified, since his 
ability would not be used at a high level for tlie balance of the war. Many 
men who failed in the school did so because of very poor tonal judgment, 
which made them unfitted for listening duty. How had they happened to be 
sent to sound training? Their high mechanical comprehension (many had 
studied college physics) raised their average enough to conceal, in the 
formula, their weakness. Such men, despite an adequate “average” ability, 
were doomed to fail in sound training when they might have been good in 
engineering, radar maintenance, or navigation. In fact, a few were salvaged 
by school officers and sent to other training, where they did well. At a later 
date, further research led to a multiple-screen selection procedure. 

Unless used wisely, the prediction-formula procedure can lure a tester into 
the ancient error of adding 3 pigs and 2 wheelbarrows, and taking the an- 
swer “5” seriously. 


PRINCIPLES OF PREDICTION 

Anyone selecting aptitude tests has to evaluate studies of validity reported 
in the literature. These studies follow the general plan outlined earlier in the 
chapter, but to interpret the results correctly one must be aware of several 
statistical principles. This section presents the most important of these. 

What Is an Acceptable Validity Coefficient? 

As validity coefficients for various tests have been presented in past chap- 
ters, the reader has probably been mentally classifying them as “good” or 
“poor.” Many tests, especially of special abilities, do not seem very satis- 
factory at first glance. But, in one seuse, a test has a .sat isfying validit y co- 
effleient if. it, is. better than otber tests for the same purpose. A correlation of 
.70 is far short of perfect, yet Deemer’s shorthand test is a substantial achieve- 
ment, simply because it yields a better prediction than previous tests for 
stenography._Probably th^ only fair standard for an acceptable-vaJidity, QO- 
efEciemt is the question: Does the test perinit us to make a better-judgment 
Ilian we could make without it-r-sufficiently better to justify it»-eosfe? 

Psychologists have no w-abaadsffied their insistence on v alidity coe fficients 
of ,70 or .80 for all tests. While we would be pleased to reach these or better 
levels, the experience of 30 years of practical testing shows that we very often 
cann ot attain such standards. It has been found. .repeatedlY.ih at coeffi cients 
as low a s .30 are of definite practical value (cf. Table 49). Onca.sinnaTTv . a 
test with much lower validity is promising for further develonment if it 
measures what no other test does. 
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Tlw ey^uation of validity coefBcients for industrial and military testing is 
now made in terms of selection cost (1). A test which substantially increases 
thejiropprtion of good employees is a test worth using. But the validity of 
Ihe testynust be balanced against attrition, the number of potential good em- 
ployees_ discarded in screening. The point may be illustrated with reference 
to Figure 61. If paper-machine operators rated 6 or below are considered un- 
desirable, the Bennett test will sort out poor employees. Of the present em- 
ployees, 19 (40 percent) are poor. If we had refused to hire any man with a 
Bennett score below 31, only 15 men out of 43 ( 35 percent) would be poor. 
A cutting score of 40 would pass 27 men, of whom 5 ( 19 percent ) would be 
poor. A cutting score of 50 passes just 11 men, but only 9 percent of them 
are poor. With diSerent cutting scores, then, we can increase the percentage 
of satisfactory employees from 60 to 91. But the attrition is tremendous. With 
a cutting score of 50, it would be very difiBcult to find enough suitable em- 
pKyees, since we would have to process four times as many applicants to 
obtain the same number of operators. Whether tliis can be done depends 
7ipon~t he~TaB br~supply. A low cutting score on the Beiin'e'tt test would be 
'helplul'in times of tight labor supply, since it would drop out four failures at 
no cost in good men. In times when applicants are plentiful, the test could 
be used with a very high cutting score, since it is certainly desirable to up- 
grade the working force until 91 percent are satisfactory. 

Two other factors are considered in addition to labor supply. One is the 
possiBilifjTof'plSSng'fejected apphcants in other jobs. This is especially im- 
portan t in' fmniai^' processing, where nearly every man must be used. Since 
every man is to be placed, there is nothing to lose if tests having very low but 
postttVFvalidities are used in assigning men. Tbe other .factor to. consider, is 
thFcost'oTEuing unsuitable worker 5 ,.If hiiing and training cost is low, there 
ma y be little va lue in screening employees even when a validity coefficient is 
.70.1^ test wit h a valid i^ of only .10 or .20 may be important if it saves tliou- 
sands of dollars otherwise wasted in training an inept flier or rehabilitating a 
minr^o develops a predictable combat neurpsis. , 

'"'Tajdor and Russell have developed tables showing the relations between 
validity coefficient, selection rati o (p ercentage of applicants to be hired), 
and improvement in percentage of satisfactory employees (22). Tiffin’s chart 
based on these tables shows this relation graphically (Fig. 65). The chart 
is based on a situation in which 50 percent of employees are satisfactory be- 
fore introduction of a proposed selection test. It is seen from tbe chart that if 
we can afford to hire only 80 percent of those tested, a test with a validity 
coefficient of .50 will cut the number of failures in half. Even a test with a 
validity of .20 will eliminate eight failures out of every hundred men hired. 

The following comment by Tiffin is a good summary of the point under 
considfiratioH--(-23, pp. 69-70) : 

“Psychologists dealing with vocational guidance and individual consultation are 
usually more interested in making accurate individual predictions than in making 
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group predictions. Most of these psychologists have tended, therefore, to evaluate 
a test almost entirely in terms of its validity coefficient. They have stressed the 
fact (which is unquestionably true for individual prediction) that there is no 
substitute for high validity: tliat if two tests have been validated against the 
same criterion, and one has a higher validity coefficient than the other, there is 
no way to make the one having the lower validity serve as well. The main point 



Fig. 65. Effect of test validity and the selection ratio on the screening 
efficiency of an employee selection test. 


of the above discussion is that in group testing, where one is interested in average 
rather than individual results, one can make the test with the lower validity per- 
form as well as the other by sufficiently reducing the selection ratio. IjLothcr words, 
in. group testing, a re.duction of the selection ratio is a substitute for high valid- 
ity, ... A reduction in the selection ratio can be utilized whenever two or more 
employees are being placed on two or more different jobs, and if tests of some 
validity are available for each of the jobs.”* 

In evaluating a validity coefficient to decide whether a test is worth us ing, 
one must ask the following questions: 

d. Is the decision being made one of crucial importance to the individual, so 
that an error in prediction will be very damaging to him?® 

2. Can we afford to hire, and later discharge or transfer to other duties, men 
who prove to be unsuccessful? 

3. Is it important to hire every applicant who will be satisfactory, even 
though this also involves hiring many men who will fail? 

'4. Does this test measure an abiliiy which is already fairly well measured by 
other tests or procedures already in use? 

‘‘This quotation and Fig. 65 r^roduced by permission of Prentice-Hall, Inc., from 
Industrial psychology. Second Edition, by Joseph Tiffin. Copyright, 1942, 1947, by 
Prentice-Hall, Inc. 

® The questions are so worded that an answer of “no” indicates tliat tests of relatively 
low validity are likely to be helpful. 
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/ 5. Is the validity coefficient much lower than the reliability coefficient of the 
j test? ( If not, lengthening the test should raise validity. ) 

' /e. Is administration of the test difficult and costly? 

24. In the light of the foregoing questions, how satisfactory is Deemer’s validity 
coefficient of about .65? 

25. In one pilot-selection study, the predictive validity of pencil-and-paper tests 
was .64 (elimination-graduation criterion). The coefficient is raised to .69 
when apparatus tests are added (8, p. 193). Is such a small increase worth 
while, in view of the questions listed above? 

26. The USES General Clerical Battery is intended to guide workers into appro- 
priate clerical positions (20, p. 154). A very low selection ratio may be used, 
since a particular unemployed worker may be directed into any one of hun- 
dreds of job families. Assume a selection ratio of .10. 

a. What improvement in success of workers recommended will result if the 
test is used to select card-punch-machine operators? The validity coef- 
ficient is .39. 

b. What improvement will result if the test is used to select bookkeeping- 
machine operators? The validity coefficient is .12. 


Prediction as a Function of Criteria and Training 


The size of a validity co.efflcient is limited by tlie reliability of tlie criterion. 
Therefore jnjmany studies ijnprovement in validity coefficients is obtained 
by refining the..cjdterion rather than by continued development of tlie pre- 
dictprs. An example of the effects of better criteria is shown in Figure 66. 
When Navy classes in ship’s engine operation were given no standard 
achievement tests, school grades had rather small correlations with predic- 
tive tests. But when two highly valid achievement tests were used in allot- 
ting school grades, the classification tests were much better predictors. It is 
to be noted that tlie earlier grades were influenced most by the abilities 
measured in the Ai'ithmetic test. When a valid measure of job knowledge 
and skill was developed, the prediction relied most heavily on Mechanical 
Knowledge and Mechanical Aptitude factors. The tests that predict a valid 


criterion may be different from those predicting an invalid criterion. 

When a prediction study is made, it is assumed that all cases included in a 
particular correlSion have been given similar treatment in all stages of the 
■study, They mn.ct have the same training, the same incentive,. and the same 
type of grading. Otherwise, the correlation will be reduced even if the test 
predicts well how each man responds to whatever experiences he receives. 
Variation in grading from instructor to instiuctor has been one major source 
of difficulty in predicting college grades. A study of prediction of success in 
20 nursing schools gave multiple correlatfdns eFdisappointing size"; from .26 
to .SdrWEen compufaBohs' wer e based oiTdne school at a time, validities 
jumped ma rkedly (as high as .80 in school A; as high as .77 in school K_)^The 
* authprs ti-aced t he low c orrelations in combined samples to variation in grad- 
ing policies of the schools (24). 
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Fig. 66 . Correlations of Navy classification tests with grades in Basic En- 
gineering School, before and after introduction of standard achievement tests 
(21, p. 307). 


If there is variation in the^ taming given different men, their feal.scotfiS 
will not Tie predicted accurately. Figure 67 illustrates this problem. The data 
plotted are the scores of two classes of student torpedomen on a prediction 
test and an achievement test. Since actual scores of all men are not available, 
these scores are hypothetical data based on a study reported by the staff of 
the Bureau of Naval Personnel (21, p. 308). For the entire scattergram, the 
correlation is positive but low. It is apparent that Class 2 had poorer instruc- 
tion, since men of the same aptitude earned poorer scores than similar men 
in Class 1. Within either class, the correlation is high, and final scores can be 
predicted in either class with an error of five points or less. 

Restriction of Range 

It was mentioned in Chapter 4 that t ests pr(edigt_less accurately when they 
ilcn.n.pplie(i.tCL a homogeneous group, ^lidity coefficients risejwheu a teisf jg 
_appli£d_to..a_gro,up witli a wide range of .abilit;y^aad..drpp when the test is 
^Jfisd.On a restricted, preseleotftd group. Many studies are based on selected 
groups. Deemer, 
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to take the course. Probablj^ many of low aptitude were not included, since 
girisjentering a shorthand course have nonnally already successfully com- 
plied, some work in typing. If Deemer’s test, then, were applied to an en- 
tirely unscreened group, a higher coefficient would result. On the.o.ther_hand, 
it would y ield a lower validity coefficient if applied to a single class, all of 
wh om h ad about e_qual intelligence. 

The effect of screening upon validity coefficients is illustrated by the AAF 
study referred to earlier. The validity coefficient of the battery for pilot selec- 
tion was in the neigh- 
borhood of .37 for men 
normally passing the 
screen. When, for ex- 
perimental purposes, a 
completely unscreened 
group was sent to pilot 
training, the failure rate 
rose enormously. In this 
group, the validity co- 
efficient rose to .66 (8, 
pp. 103, 193). 

TnvesH gators are fre- 

qu ently p erplexed when 
a vari abl e listed in the 
job analysis fails to pre- 
dict thg. griterion pf.suc- 
c ess. The jo b analysis 
may ,hay.e been .correct 
in listing ,.the ability as 
essentiaLtajjlie job, yet 
s election ..jnavJiayc re- 
duced its significance as 
a predictor .’TFbne" is to 
continue to draw his 
sample from a similarly selected group, this variable will not help in pre- 
diction. But if prediction on a less selected group becomes important, the 
variable which was once of no value may turn out to be a good predictor. 
For example, intelligence tests have consistently been poor predictors of suc- 
cess injteaching. .The explanation is obvious; Nearly every teacher has sur- 
yived year s of schooling with at least adequate grades, so thateyery teachej 
has a fair to superior degree of intelligence (Fig. 68). Among those so se- 
lected, differences in tested intelligence play little part in determining suc- 
cess as teachers. Grante d that an intelligence test will not help a school 
£jj^tm 3 iiiire_±eachers,,an ffitelligence test is still a major factor in advis ing a 
^irl i n hig h s chool whe ther she is likely to be able to complete a teacher- 



Fig. 67. Relation between classification test scores 
and achievement test scores for two classes of torpe- 
domen. (Hypothetical data, based on a study re- 
ported by the Bureau of Naval Personnel [21, p. 
308].) 
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D> 

c 


ja 


o 

Q. 


c 


All Adolescents Who Might 
Want to Teach 


Adolescents Reaching 
a Teaching Position 


■V 



Low 


Intelligence 


High 


Fig. 68. Hypothetical data illustrating the effect 
of preselection upon correlation. Dots show scores to 
be expected if every ninth-grader interested in teach- 
ing later enters the profession. Circles show scores of 
persons likely to survive gradual elimination as a 
consequence of low school marks. 


27. Preselection operates 
on some characteris- 
tics and not on others 
in any .situation. If 
one were considering 
the probable success 
in industrial jobs of 
graduates from an en- 
gineering school, what 
characteristics would 
have a restricted 
range owing to pre- 
selection? What char- 
acteristics would prob- 
ably not have been so 
restricted? 

Contamination 
of Criteria 
It is important to 
guard against contami- 
nation of criteria, which 
spuriously raises correla- 
tions. Wherever ratings 
are used as criteria, 
there is a possibility that 
teachers, foremen, or 
other judges are influ- 
enced by partial knowl- 
edge of the prediction 
test scores. Tochers 
may be influenced in 
their gradihgTy^iilowl- 


edge of a pujgil’s I_Q. A forenwn may rate a man higher than his performance 
warrants, because he knows the man has considerable experience. These in- 
fluences raise the correlation between grade and intelligence or rating and 
experirace. The only way to eliminate contamination is to keep test data 
secret until all criterion scores have been collected. 


28. In each of the following situations, trace the way in which contamination might 
occur, and suggest an improved procedure to avoid it. 

a. A psychologist administers aptitude tests to entering college freshmen, and 
from the results predicts each student’s success. Success is determined after 
two years by noting which students have been dropped from school by 
the school guidance committee for unsatisfactory work. The predictions 
are kept in a locked file and not made available until the two yeius have 
passed. The psychologist is a member of the committee, but did not dis- 
cuss the predictions. 
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b. Test data on pupils intelligence, mathematical ability, and other facts are 
made available to science teachers so that they can do better teaching. 
Learning in science, which was to be predicted, is judged not by ratings but 
by an objective test of ability in science given at the end of the course. 

c. Tests for selecting salesmen are being tried experimentally. Because they 
are considered probably valid, the results are given to the sales manager for 
his guidance in assigning territories to the salesmen in the e.xperimental 
group. After a year of trial, each man is judged by the amount of his sales 
in relation to the normal amount for his territory. 

d. Flight instructor’s ratings are used as a basis for promoting men from pri- 
mary to advanced training. It is desired to check the validity of these ratings 
as predictors of success in advanced training. Advanced training is taken at 
the same field, with a different instructor. This man’s judgment supplies the 
criterion. 

Necessity for Confirmation of Findings 
When an investigator has once obtained a satisfactory validity coefficient, 
he tends to install his program and stop research.. Other workers, reading his 
report of the study, may jiccept his test as valid and put it to work in their 
own siteations. 'Thisjs unsound. In the first place, any validation results-are 
inHuenced .^..chance, and correlations will fluctuate from sample to^sample. 
As a resu lt, the test which proves best in one sample may prove not to be the 
tet predictor in another similar sample. Even when the results are based on 
a large sample, the particular critical score or the particular weights most ef- 
fective in a multiple correlation are certain to change when a new group is 
tested. If the same formula is applied in other groups, the correlation is sure 
to drop. Moreover, the s uppl y of men and conditions of tr aining chang e im m 
time to time in any situation where prediction is attempted^ It follows that 
the investigator mus t re^termine the validity of his- prediction technfq[ue 
periodically. 

This problem is illustrated by data selected from a study of the success of 
recruits in a Navy radio school (9). Instead of making only one study and re- 
lying on the resultant correlation, the investigators tested successive com- 
panies of men. It is apparent from Table 51 that different tests would have 
seemed valid in different companies, if any one had been used as the sole 
experimental group. The most striking difference is in the electricity test. The 


Table 51. Change of Validity Coefficients from Group to Group (9) 


Company 

Number 

Test 

Mean 

SeCf. 

Correlation with Grades 

Mathematics Radio Lab. 

A 

94 

Mental ability 

51 

11.2 

.50 

.60 



Mathematics 

35 

10.0 

.68 

.73 



Electricity 

95 

6.8 

.19 

.38 

B 

too 

Mental ability 

53 

10.6 

.61 

.40 



Mathematics 

34 

11.3 

.82 

.54 



Electricity 

87 

11.3 

.25 

.69 

C 

100 

Mental ability 

56 

10.1 

.41 

.20 
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experimenters traced the shift in cori'elation to the fact that Company A was 
composed of recruits having some electrical experience, whereas Company B 
included many men with little electrical background. 

If a battery selected for a given situation must be rechecked from time to 
time, we surely cannot expect the battery used in one situation to work with- 
out modification in a dififerent situation. If the second situation is much like 
the first, the tests useful in one place should prove valid in the other. If the 
situations differ appreciably, it may be necessary to introduce new tests or to 
change the weights assigned. In using the findings of research in ano ther situ- 
ation, there is no substitute for confirmatory research in the new, .situation. 

29. What selection or classification scheme would probably have been used if only 
Company A had been studied? If only Company B had been studied? 

30. Which shifts in correlation (Table 51) can be explained in terms of change in 
the range of talent from gi-oup to group? 

SINGLE-PURPOSE APTITUDE TESTS 

One by one, we have considered the various important types of ability 
tests. General mental ability, which was first considered, is a factor found to 
influence success in almost every job. Attention was next directed to tests in- 
tended to measure each of the widely important special abilities; they also 
are likely to be valuable in prediction in many situations. To complete the 
picture of ability tests available for predictive uses, it remains only to survey 
the tests designed for prognosis of specific single activities. Deemer’s steno- 
graphic test is an example of this type. Complex coordination tests also are 
useful only for predicting very limited performances. 

Table 52 contains a representative listing of the prognostic tests which 
have been described in the literature. Tests have been desigjied for prognosis 
in most secondary-school subjects and for many college .subjects, in an at- 

Table 52. Representative Prognostic Tests 


Content and Validity 

Test and Publisher Special Feotures Coefficient" Comments of Reviewers^ 


Prognostic Tests for Academic Subjects: 

Iowa Placement Exam!- Mental operations and .50^.60 with 
nations (Univ. Iowa) knowledge tike that used marks in the 

in the course. Separate individual 

tests for foreign ianguage, course 

chemistry, etc. 


Lee Test of Algebraic Arithmetic, proportions. 
Ability (Pub. Sch. PubL number series, formulas 
Co.) 


.71 with 
achievement 
test 


Symonds Foreign Lon- Knowledge of Engildi gram> 
guage Prognosis mar, exercises in learning 

(Teachers Coll., Esperanto 

Columbia Univ.) 


.60 with 

achievement 

tests 


"Higher prognostic validity 
for that purpose than gen- 
eral intelligence tests. . • . 
Distinctive advance in psy- 
chometric procedures" (16, 
pp. 228-229). 

“Undoubtedly of some value 
in detecting ability in al- 
gebra. . . . Would be 
much better If each subtest 
not quite so homogene- 
ous . . (Wilks, 4, p. 280). 

"Hardly more than a lingu- 
istically weighted intelli- 
gence test" (Kaulfers, 4, 
P.l 57). 
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Content and Validity 

Test and Publisher Special Features Coefficient“ Comments of Reviewers"' 


Tests for Professions and Vocations! 


Engineering and Physical 
Science Aptitude 
(Psych. Corp.) 


Stanford Scientific Apti- 
tude (Stanford) 


Medical Aptitude (Assn. 
Am. Med. Coll.) 


Aptitude for Nursing 
(Geo. Washington 
Univ.) 

Detroit Clerical Aptitudes 
(Pub.Sch.Publ.Co.) 


Iowa Legal Aptitude 
(U. Iowa) 

Other Prognosis Tests: 
Tests for radio teleg- 
raphers (U.S. Army, 
unpub.) 


Combat Information Cen- 
ter (tactical direction) 
Aptitude (U.S. Navy, 
unpub.) 


Mathematics, spatial, me- 
chanical comprehension, 
vocabulary, etc. Designed 
for war-industry trainees 


Various types of judgment, 
reasoning, carefulness. 
Supposed to select re- 
search workers 

Verbal abilities, with em- 
phasis on scientific con- 
tent. Test not released. 
One annual nation-wide 
testing for prospective 
medical students 

Information, reading, mem- 
ory, verbal; nursing con- 
tent 

Handwriting, checking, 
arithmetic, alphabetizing, 
etc. 


Verbal abilities with legal 
materials 


Discrimination of code pot- 
terns, learning of a few 
code characters 


Translation of polar co- 
ordinates to grid, scale 
reading, relative move- 
ment 


.73 with grades 
in war-emer- 
gency train- 
ing 


.74 with rat- 
ings of sci- 
ence students 


.37 with fresh- 
man medical 
grades;'' 
higher r ex- 
pected in un- 
screened group 
.64 with first- 
quarter nurs- 
ing grades 
(lB,p. 245) 

Withmorkst "Basically, the test measures 
bookkeeping, factors common to inlelli- 
.56; short- gence tests and speed 

hand, .37; tests" (Lorge, 4, p. 434). 
typing, .32 
.42-.77 with 
law-school 
grades (1 ) 

.25, .33 with 
time required 
to attain code 
speed of 16 
w.p.m. ( 1 9) 

.40-.55 with 
success in CIC 
training (21, 

p.ns) 


"Should be validated in terms 
of the situation in which it is 
to be employed. . . . User 
will find it necessary to pre- 
pare norms based on his 
local group" (Frederick- 
sen, 5). 

"The test is not valid. . . . 
Zyve’s coefficients are al- 
most certainly spurious” 
(16, p. 238. See also 
Crawford, 4, p. 453). 


“ From tL>.st manual unless utlieTwi.se indicated. , , . 

‘ The reader should refer to the source cited for a full statement of *e reviewer s opinion. 
« Computed by tetrachorie method from data reported by Moss (IS). 


tempt to improve upon the correlations of .40— .55 that can be obtained, using- 
"^ ^eraj mentaFt^tsl Tests for vocational guidance and for selection among 
applicants for training have sought to measure the abilities required for par- 
ticular occupations, especially at the professional level. Except in the field 
of medicine, however, such tests have had limited usefulness. Several tests 
were designed for selection of men for military specialties which cannot be 
predicted by the general classification tests. 

On the whole, standardized prognostic tests for special purposes bave-dei-- 
dinffi^ji^For guidance in a high school, it would be impossible to give 
every pupil a test for every subject offered, and prognostic tests have had 
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validities only slightly better than those for more general tests. Most counsel- 
ors find that a test profile describing such abilities as verbal, arithmetic rea- 
soning, and mechanical comprehension is more useful than one producing 
scores such as “aptitude for engineering”- or “aptitude for woodworking.” 
Prognostic tests geared to a particular subject or occupation have an ad- 
vantage when many people are to be considered for that one performance, 
and where the performance to be predicted involves special factors not meas- 
ured in tests otherwise available. Therefore, prognostic tests help the short- 
hand teacher identify which pupils are likely to need special instmetion, as- 
sist medical schools to select deshable entrants, and aid the Navy in selecting 
radar operators. But a counselor, helping a young man choose a vocation, will 
ordinarily map out his abilities with tests measuring widely important fac- 
tors, rather than employ test after test for single vocations. Only in tire final 
stages of guidance, when choice has narrowed to a few vocations, are special- 
ize^prognosis tests efficient. 

Prognostic tests for national use have one severe limitation, namely, that 
the requirements for success in an activity may be different in different. situa.-. 
tiohs'. Perhaps Latin courses seem likely to bo uniform enough so that a test 
which predicts performance in Massachusetts would predict equally well iir 
California. But Kaulfers points out that the validity of a prognostic test in a 
foreign language depends on the type of course ofered, and that the avail- 
able tests are appropriate only for a now-obsolescent grammar-translation 
type of course. He adds, “Foreign language prognosis tests of the [conven- 
tional] type are usually excellent means for reducing foreign language 
enrollments in non-functional courses taught by teachers incapable of ad- 
justing either method or content to the needs, interests, and abilities of 
children” (4, p.l58). 

Pver and oyer, prognostic tests with general names have been found not to 
cover yalidly all the situations covered by such a name. This has been most 
clearly proved by the many studies of “aptitude for selling,” for which a 
test would indeed he desirable if practical. Say Kornhauser and Schultz, 
reviewing the literature (12): ‘Tt is abundantly clear from these and other 
investigations that effective selection procedures must be worked out in 
relation to the particular type of selling; there are at present no general (sales., 
ability’ tests. Where psychologists have earned on thorough and continuous 
research, however, to develop tools which may aid management in selection 
of specific sales personnel, definite results of value have been achieved.” 

Since it is not practical here to summarize the many investigations of 
prognosis for various purposes, Table 53 has been prepared to draw the 
reader s attention to sources where summaries of prognostic studies may be 
found. 

31. The aptitude tests in the Iowa Placement Examination require about 40 min- 
utes each. The usual general ability tests for college freshmen require about 
one hour. Discuss the way in which the Iowa tests could be most effectively' 



PROBLEMS IN PREDICTION: PROGNOSTIC TESTS 267 


Table 53. References Summarizing Research on Prognosis 


Performance 

Predicted 

Author of Suminary, 
and Reference 

General Findings 

Clerical work 

Bllls-Harnmitt 

Nail, Bus. Educ. Q., 

1936, 4, No. 3,26-30 

Generoi mental tests helpful In selection of employees; 
shorthand demands special abilities not tapped by 
general tesU. 


Osborne 

Natl. Bus Educ. Q., 

1945, 13, No. 4. 19-28 


Dentistry 

Bellows 

J. consult. Psychol., 

1940, 4,10-15 

Five essentioi areas; college optitude, ability in science, 
interests, spatial ability, eye>hand coordination. 


(Anon.) 

Vet. Admin. Tech. Bull., 

1 947, TB 7-44 


Law 

Adams 

Educ. psychol. Measmt., 
1943, 3, 291-305; 
1944, 4,13-19 

Iowa lego! Aptitude Test plus preview grades yields 
validities os high as .77. 

Music 

Bienstock 

J. educ. Psychol., 

1 942, 33, 427-442 

Results are not conclusive, and present methods are in> 
adequate for predicting individual achievement. Tests 
are useful in a negative way, to prevent failures. 

Reading, 

beginning 

Robinson-Hall 

Bull. Ohio Conf. Reading, 
1942, No. 3 

Readiness tests ore fairly valid, but no better than intel- 
ligence tests or ratings based on study of each child. 

College 

grades 

Segal 

U.S. Off. Educ. Bull., 

1934, No. 15 

Ability tests, achievement tests, and high'sehool marks 
all correlate .50-.55 with college marks. Multiple 
r's readi .60— .70. 


Durflinger 

J. Amer. Assn. Coll., Regis- 
trars, 1943,19, 68-78 


School success, 
all levels 

Eurich-Cain 

Encyclopaedia of Educ. Res., 
1 941, pp, 838-860 



used, considering who should take the tests, and what judgments would be 
based on the data. 

32. Which of the prognostic tests in Table 52 would be expected to have high cor- 
relations with group tests of general intelligence? What does this imply regard- 
ing the usefulness of these tests? 

33. Do you agree with Kaulfers that courses should be tailored to student apti- 
tudes, rather than designed to fit a limited group of pupils who can be 
preselected? 


SUGGESTED READINGS 

Bellows, Roger M. “Procedures for evaluating vocational criteria.” J. appl. Psychol, 
1941, 25, 499-513. 

Paterson, Donald G., and others. “The criterion of mechanical ability,” in Minne- 
sota mechanical ability tests. Minneapolis; Univ. of Minnesota Press, 1930, pp. 
144-202. 

Stalnaker, John M. “Personnel placement in die armed forces.” /. appl Psychol, 
1945,29,338-345. 
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Tiffin, Joseph. “General principles of employee testing," in Industrial psychology. 

New York: Prentice-Hall, 2nd ed., 1947, pp. 46-80. 

Toops, Herbert A. “The criterion.” Educ. p.sychol. Measmt., 1944, 4, 271-297. 
Van Dusen, A. C. “Importance of criteria in selection and training.” Educ. psychol. 
Measmt, 1947, 7, 498-504. 
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CHAPTER 12 


Measures of Achievement 


FUNCTIONS OF ACHIEVEMENT TESTS 

Achievement tests attempt to determine how much a person has learned 
from some educational experience. This chapter will demonstrate the variety 
of tests which have been devised and call attention to the many uses tliey 
have over and above the traditional assignment of marks. Most achievement 
tests are prepared for local purposes, but some have been standardized for 
widespread use. Although standard achievement tests have real merits, they 
are best adapted to fairly standardized types of education. 

Perhaps the best way to see the value of carefully prepared, standardized 
tests of achievement is to examine their use in a somewhat unfamiliar situa- 
tion. The Navy has long operated, with notable success, an immense program 
of vocational education. Navy recruits receive training in a variety of skills 
needed to maintain a complex fighting force. When civilians with educational 
and psychological training entered tlie service during World War II and 
were given responsibility for some Navy schools, one major change they 
proposed was the development of standard achievement tests for most serv- 
ice schools. The history of this program is given in a report by the Bureau of 
Naval Personnel (28, pp. 287-^4). The advantages the Navy found may, 
be summarized as follows: 

Tests aided in standardizing the instruction of all .schools preparing 
men for a given duty. Although objectives of instruction and curricula were 
standardized, individual schools had been deviating by neglecting or over- 
emphasizing certain topics. Any deviation would be reflected in scores on 
the test lower than in other schools. The tests t herefore forc ed teache rs to 
do as the course planners inten^d. 

- 2. The tests provided a basis for suggesting curriculum revision and im- 
proving, instructional methods. If certain learnings were found not to be 
mastered in the time allotted^ if was possible to reconsider the length of the 
course or tire emphasis placed on these outcomes in the course. Without such 
tests, instructors often assumed that, having “covered” a topic, they had 
taught it as well as possible. Test results were of great interest to irrstructors 
and often caused them to seek help from supervisors or specialists to improve 
their teaching. Th e te sts .helped.to make teachers self-critical., 

Some tests required that the student demorrstrate j ob .sidlls rather tharr 
merely give ba ck verbal, answers. Such t ests directed the_ attention of in- 
st metors to the importance of, teaching, the. bejir.avigrs the course ‘was in- 
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tended to produce. Reliance on the lecture-discussion method of teaching 
declined. A definite improvement in the effectiveness of training was found. 

4. In the .absence, of .tests, grades had been assigned to students on the 
basis-of-subjective impressions, or, at best, with the aid of teacher-made 
tests-JSuchjnarks were unreliable, failed to give attention to all significant 
asp^ts .of the course, and were subject to bias. Even objective tests, when 
made by teachers without special training, have generally been poor meas- 
uring instruments. Careful preparation of tests increased the probability 
that each man would be rated fairly and accurately. 

5. Tesy„ winch' placed emphasis on all significant aspects of the course 
made_sure that all-round achievement would be considered and deficiencies 
noted. In tlie Basic Engineering school, before standard testing, grades 
were strongly influenced by performance in mathematical aspects of the 
course, probably because this ability was easy to test. Examinations were 
altered to place emphasis on mechanical understanding. As a result, atten- 
tion was drawn to men who, despite skill in arithmetic, were poor in other 
essentials. 

6. Standard tests permitted analysis of grading standards and reduced 
variations due to generosity or severity of graders. When considered in 
relation to aptitude scores, achievement scores identified classes which failed 
to make adequate progress (Fig. 67). When marks were based solely on 
standard tests, grades given at different places represented the same degree 
of proficiency. 

7. Motivation of students and instructors was improved by developing 
nyalry based on a fair standard. Showing a man the particular ways in which 
he needs further skill is an excellent way to motivate study. 

8. More accurate final marks were a better criterion against which to vali: 
^ate.pH:^osed-selection tests. Accurate marks were more useful in making 
subsequent administrative decisions about men. 

Nearly all the above purposes and advantages have their counterpart in 
public schools and colleges, because the same problems are present. Marks 
are unreliable, and may emphasize some aspects of the course to the exclu- 
sion of others. Different instructors in the same department grade differently 
and teach with different effectiveness. Teachers often depart from planned 
curricula. But to some extent, the outcomes which were virtues in military 
training are undesirable in general education. These difficulties will be dis- 
cussed in the next section. 

One .si gnificant contribution of. standardized tests has been to. break dQ.wii 
the “time-serving” co nce pt of education. While a p ersons standing in school 
is still usually judged by the number of years he has put in, oFthe number 
of courses he has passed through, this system is being increasingly criticized. 
In one^Tu^, thousands of college students took standardized tests of knowl- 
edge in various fields; it was found that many coUege seniors knew less than 
the average high-school senior (18). A person who has spent four years in 
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high school or college may not have acquired an education during that time. 
Tests are therefore being given increasing weight as evidence' of educational 
development. Following World War II, high-school diplomas have been 
issued to many veterans who did not complete high school but who demon- 
strated satisfactory educational development on standard tests. College 
credit was also allowed on the same basis, on the reasonable argument that 
a man who could do well on a standard test in English did not need to serve 
time in a freshman English course. This use of carefully prepared standard 
tests appears likely to increase. 

Achievement tests have clinical and predictive uses, apart from their more 
common function of appraising learning. Tests of achievement are often 
superior predictors of work in future coui'ses. They may be used in hiring 
employees, examples being the usual civil service examination and the arith- 
metic tests used in stores and offices. They provide valuable insight into 
many clinical cases. A delinquent, for instance, may turn out to have ex- 
cellent achievement in school subjects even when teachers have given him 
low grades. Acliievement tests are also useful in evaluating and diagnosing 
mental deterioration from organic or emotional causes, in making recom- 
mendations regarding the feeble-minded, and so on. 

DIFFICULTIES OF ACHIEVEMENT TESTING IN SCHOOLS 
Historical Background 

Examinations have been used always, but standardized tests are a-cortt. 
paratively modern innovation. Fifty -vears-ago teachers around the country 
taught in their own ways, set their own expectations of their pupils, and 
assigned grades independently. If the average fomth-grader in Mill Corners 
could outread sixth-graders down tlie road at Pinetown, that fact was never 
brought to light. Parents and employers looked at the performance of school 
graduates and, according to their dispositions, were pleased or displeased 
with the results. There was no snuud basi.-ii. for jitdg in g whet her th e school 
had taught as well as might reasonably be expected. The ^Lj^teniatic 
comparison of school attainment was made by an educational cnisader, 
J. M. Rice, in 1897. Rice was convinced that the pressure for perfection in 
certain tj^es of achievement was leading to faulty emphasis in education, 
and prepared a test of spelling ability to determine what results could be 
expected. His test was given in twenty-one scattered cities and showed that, 

■ regardless of the time devoted to spelling, the test scores of eighth-graders 
were about the same in all cities. He showed further that although children 
in some cities were superior in spelling during early grades, presumably 
because of stress on that subject, such differences vanished by the end of 
schooling (25). It is ironical that ^cej^ed by such evidence to show that 
the time spent on formal leanring could be redu ce^, .sa vinpr mm-B tiTn^~~fnr 
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an enriched curriculum. Instead, the testing movement which he fathered 
jbecame a factor tending to hold the schools to limited curricula. 
'^Educators were quicldy impressed with the advantages of determining 
when schools were "up to standard,” and tests of reading and arithmetic 
were prepared and widely used. Tests in other subjects followed rapidly. 
At the height of the enthusiasm for standard tests, in some cities every 
pupil was given a nationally distributed examination each June in nearly 
every course he studied. 

Despite the marked benefits conferred by tests, similar to those discovered 
in the Navy program outlined above, the testing craze eventually produced 
se rious d islocations in -the school- program, The tests became a taskmaster 
which everyone in the school found himself trying to serve. When good 
standing bh the final test in June became the goal of the entire year s work, 
use ful cla ssroom activities which would not raise test scores received scant 
jejicouragement. The harmful influence on the school program of tests used 
unwisely will be outlined in detail in the following sections. Because of 
increased recognition of these problems, there has been a swing away from 
msTndatory testing since about 1930. Standardized tests are still used in most 
school systems, but they are used in moderation. In so und pra ctice, evalua- 
tion of a pupil or a teacher is never based on these tests alone; instead, the 
teiFs are~6eated a?’one source of data to be linked with many oAer facts 
in making a final evaluation. Tests made by the classroom teacher to fit his 
c lass are now.r.eEard.ed as an indispensable and valid type of measurement; 
at one point in the not-distant past, they were regarded as inferior substi- 
■'tuFes for “die "standard test, then considered to be the ‘"true” yardstick of 
Attainment, 


Problems of Validity 

Sampling of Content 

Standard tests have not always given valid measures of the content areas 
they dealt with. If a test purports to show how much Theodore has learned 
in French, but asks questions involving French words Theodore has not 
studied, the resulting measure may be quite unfair to him. The necessity of 
fairness is nowhere greater than in standardized tests, since these tests are 
used to~compare schools using different textbooks and covering somewhat 
different material. 

This has led to considerable attention to problems of sampling in achieve- 
ment tests. The only fair test of factual knowledge in an entire area is one 
whidb iveights equally all significant topics in the area. A history test which 
emphasizes political and military history is fair only if that is the sort of 
history tlie school desires to teach. A test can present only a sample of the 
items taught in spelling, history, arithmetic, or grammar. When a sample is 
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taken, chance in sampling is likely to give some, pupils or schools an ad- 
vantage through the inclusion of items on which they happen to be superior. 
One of tlie major advantages of tlie objective test over the essay test is that,' 
by including moi'e questions, the objective test reduces the probability that 
a poor student will earn a high score by a lucky break in the selection of 
questions. 

Sampling was seriously faulty in early achievement tests. Often they tested 
spelling words someone believed pupils “ought” to know, rather than words 
actually being taught in school. A test in history, based on facts chosen from 
only one textbook, would apparently measure a significant sample of historical 
knowledge but in truth would be a most unfair test. To^arc^against invalid- 
ity due to poor sampling, recent tests of knowledge and skill are based on 
careful consideration of curricula. The test designer looks at many courses of 
study to decide what topics to cover. A sampling plan is used to draw items 
equally from all significant topics. Such a test will still give a low score to a 
class which has placed its entire emphasis on one or two topics and mastered 
them thoroughly. But these students do have a low mastery of the total field 
as taught in most schools. 

Curricular Validity 

In one sense, a test which is a fair sample of the content of a field is a 
valid test. In another sense, it may be quite invalid. When a test is to be 
used to judge the effectiveness of a learning experience, it is important that 
it have curricular validity. A test has cttiricfular validity if it measures fairly 
the extent to which pupils have learned what the curriculum was intended 
^ designer of a standard test cannot adapt his test to local 
needs. But teachers must often adapt their courses to fit the problems in a 
particular school or community. Different knowledges in science, for ex- 
ample, are important for pupils who will later become Ford factory workers, 
Minnesota dairy farmers, students at M.I.T., or housewives. If each of 
these groups studied a different science curriculum and mastered it, some 
groups would do badly on a standard test of science knowledge because it 
would ask what they had not studied. A test whicli does not fit the course 
gives a misleading answer to such common questions as “How effective 
has the teaching been?” and “Which pupils have failed to learn what has 
been presented?” 

Measurement of Appropriate Behaviors 

It is fallacious to measure all achievement in terms of factual knowledge 
and a limited group of skills. If you ask a teacher why he is teaching his 
course, he will list a large number of objectives. The geometry teacher, 
for example, would deny that the purpose of his course is to transmit a 
certain number of theorems, a few practical principles, and some skills with 
ruler and compass. Instead, he would speak of developing habits of reason- 
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ing, skill in identifying assumptions, skeptical attitudes regarding unveri- 
fied conclusions, and so on. Yet in most subjects traditional achievement 
tests have stressed specific facts and skills to the exclusion of the more im- 
portant outcomes of the course. In a series of studies of college and high- 
school teaching, Tyler and his co-workers identified a large number of pur- 
poses schools claimed to hold. These aims may be grouped as follows 
(adapted from 24, 26) : 

1. Functional information. Not mere rote knowledge, but knowledge 
that can be applied to new situations where it is relevant. 

2. Thinking skills and habits. 

3. Attitudes and social sensitivity. (Tolerance, scientific .spirit of in- 
quii-y, appreciation of music, etc. ) 

4. Interests, aims, purposes. ( Vocational goals, wise use of leisure, etc. ) 

5. Study skills and work habits. 

6. Social and personal adjustment. 

7. Creativeness. 

8. Physical healtli. 

9. A functional philosophy of life. 

In order to obtain evidence of pupil growth in these qualities, it is neces- 
sary to define them in terms of behavior. How does a person act who has 
ability to draw sound conclusions from scientific data? How can we rec- 
ognize pupils who have “a functional philosophy of life”? Some objectives 
are easier to translate into behavior terms than others, but when such a 
translation has been made, it is possible to gather data, by tests or observa- 
tions, to show dianges in pupils. "Skill in interpretation of data,” for example, 
is shown by certain definite actions. If we give a skilled person a table or 
graph related to a topic he has not studied, he does certain things. He identi- 
fies major trends. He disregards fluctuations that are probably due to varia- 
tions in sampling. He reports that factor A and factor B change together, 
not that factor A causes factor B. These and other interpretation skills can 
readily be tested, because we have defined precisely what actions give 
evidence that the subject possesses the desired skill. 

1. For one of the following courses, try to list all the important objectives. 

a. Study of literature in the jiuiior high school. 

b. A course to train union officials for collective bargaining. 

c. A course to train junior executives in human relations. 

2. Would the purpose of a course in "Eiuopeim history since 1815” be the same in 

an American and a Russian university? Could the same test validly be used for 

both courses? 

3. Define each of the following objectives in terms of specific behaviors. 

a. “To train young people mr wise parenthood.” 

b. “To increase appreciation of good literatvure." 

c. “To prepare young people for the duties of citizensliip.” 



276 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


Facts and skills loom so large in the usual classroom that teachers and 
test designers have often emphasized them out of proportion to other types 
of outcome. Although it is important to measure gains in knowledge and skill, 
a pupil may earn a high score on a test covering only memory-type material 
and yet have made little progress toward understanding of the course. A 
course in cooking presumably is supposed to improve ability to cook. But 
when one teacher tested a college class on their knowledge of scientific prin- 
ciples underlying cookery, and also had them cook food, the quality of cook- 
ing correlated only .25 with the verbal knowledge (3, p. 60 ) . In 14 college 
coui'ses Tyler correlated tests supposed to measure different types of learn- 
ing outcomes. The correlations (corrected for errors of measurement) were 
small: knowledge of facts vs. application of principles, about .45; knowledge 
vs. inference, .35; application vs. inference, .40 (17). 

Studies of forgetting give weight to the argument that changes in think- 
ing and attitudes, as well as knowledge, should be measured. These studies 
show that facts which are little understood are quickly dropped from the 
mind. Attitudes and changes of thinking habits are usually found to be much 
more lasting. Tyler gave a zoology test to one college class before they 
studied zoology, at tlie end of the course, and again after another year in 
which they studied no zoology. The re.sults in Table 54 indicate that the 
lasting clianges were primarily in ability to apply principles to new problems 
and to draw conclusions from data. 

4. In a certain course, the correlation between a factual test and a test of ability to 
apply knowledge to new situations is .40. Assume that grades are assigned as 
follows: 10 percent A, 20 percent B, 40 percent C, 20 percent D, and 10 per- 
cent F. What grades will A students on the first test receive if the second test is 
used as a basis for grading? (Use the scatter diagram on p. 37.) 


Table 54. Improvement in Abilities in Zoology, Measured at the End of the Course 
and One Year Later (after 33, p. 76) 




Mean Score 


Percent of 





- Gain During 


Beainnina 

End 

One 

Course 


of 

of 

Year 

Which Was 

Type of Examination Exercise 

Course 

Course 

Later 

Later Lost 

Naming animal structures pictured in diagrams 

22 

62 

31 

77 

Identifying technical terms 

20 

83 

67 

26 

Recalling information 





a. Structures performing functions in type forms 

13 

39 

34 

21 

b. Other facts 

21 

63 

54 

21 

Applying principles lo new situations 

35 

65 

65 

0 

Interpreting new experiments 

30 

57 

64 

25® 

Average, all exercises 

24 

74 

63 

22 


" Gain over scores at end of course. 


Testers have sometiiues ff infusedjabih’ty ( maximum performance) .with 
typicalpeifo riTianc e. Many students who pass examinations by showing 
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that they know what should be done may not do the right thing when the 
problem arises in life. It is easy, for example, to prepare a true-false test 
for a course in “how to study.” After exposure to a few lectures on principles 
of study, most students know what good study habits they should have and 
can answer the test. But the gap between what students know about study 
and what they do about it is large. The need for testing ty'pical behav'ior is 
gre at, yet most ac hievement tests have stressed abihties produced on de- 
mand. Tests of typical behavior are needed to evaluate the efiFectiveness of 
courses teaching handwriting, leadership or personnel management, resis- 
tance to propaganda, accuracy in arithmetic, and many other objectives. 
Some devices for testing typical behavior are described in Part III. 

Some tests of achievement permit irrelevant variables to affect the score. 
Readin g ability a ffects scores on almost all achievement tests,. This is diffi- 
cult to avoid, but a valid mea.sure of knowledge is not obtained if a person 
who knows a fact misses an item about it because of verbal difficulties. 
Sometimes the verbal factors play a large part in the test. The Navy 
Mechanical Knowledge Test contained four types of item: mechanical facts, 
tested verbally; mechanical facts, tested pictorially; electrical facts, tested 
verbally; and electrical facts, tested pictorially. The intercorrelations in 
Table 55 show that in this case similarity of content produced lower correla- 


Table 55. Correlations of Tests Having Similar Form, and Tests Hav- 
ing Similar Content (Navy Mechanical Knowledge Test) (8) 



Correlation 

Correlation 
Corrected for 
Unreliability 

Tesfs limitar in form, different In content: 

Verbal tests: mechanical vs. electrical 

.63 

.79 

Picforial tests: mechariical vs. electrical 

.64 

.86 

Tests similar in content, different in fonn: 

Mechanical: verbal vs. pictorial 

.61 

.71 

Electrical: verbal vs. pictorial 

.51 

.74 

Tests different in both form and content: 

Mechanical verbal vs. electrical pictorial 

.49 

.63 

Electrical verbal vs. mechanical pictorial 

.45 

.59 

Kuder-RIchardson reliability coefficients: 

Mechanical verbal 

.89 


Mechanical pictorial 

.82 


Electrical verbal 

.71 


Electrical pictorial 

.67 



tions than similarity in form. In other words, the form of the items largely 
determined the score received. Another study made at a naval training 
center provides even stronger evidence that the verbal element in tests is 
frequently undesirable. Training of gunners had been validly evaluated 
by scores made in operating the guns. To obtain a more economical substi- 
tute, verbal and pictorial tests were developed. Identical information was 
tested in the two forms, the same question being asked in words alone or by 
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means of pictures supplemented by words. Questions dealt with parts of the 
gun, duties of the crew, appearauce of tracers when the gun was properly 
aimed, etc. The pictorial test had a conelation of .90 with instructors’ marks 
based on gun operation whereas the validity of the verbal test was only .62. 
The verbal test was in large measure a reading test; it correlated .59 with a 
Navy reading test, while the pictiure test correlated only .26 with reading 
(30). Obviously, scores on verbal achievement tests should be viewed with 
suspicion for pupils who are poor readers. 

Another variable which reduces vabdity of achievement „test.s is.J:he_.^ced 
factorTfime limits are often needed for administrative convenience, bur 
when speed becomes a major element in determining a persons score, the 
score is likely not to represent his attainment accurately. Speed is a legiH- 
mate element in achievement tests oidy^yhen speed is, an pbjecttve of the 
courir Speed is relevant and important in tests of typing attainment or 
reading facility, or tests of aritlimetic for use in hiring cashiers. Speed is 
irrelevant if we wish to know how large a pupil s vocabulary is, how much 
science he knows, or how accurately he can reason. In most achievement 
tests a speed loading can be justified only if the test is to be used empirically 
to predict success in a task where speed is helpful, or if data are available to 
prove that scores on the speeded test correlate very highly with scores on 
the unspeeded test. Speed tests have limited validity for describing th^ 
knowledge of individual pupils, rince a few well-informed pupils are slow 
workers. 

Recognition and Recall 

Because only objective tests can be depended upon to compare pupils 
tested in different places, multiple-choice and other recognition items are 
necessarily given great emphasis in standard testing. Tliis has been a source 
of concern to many teachers who felt that only tests requiring free responses 
could measure adequately what they were teaching. Recognition of right 
answers is easily tested by objective items, but where the purpose of teaching 
is to produce ability to recall or invent new solutions, teachers tend to prefer 
free-response tests. The English teacher often prefers to have her students 
judged on a sample of their free writing, rather than on tests of the copy- 
reading type where the student merely identifies errors. The mathematics 
teacher has felt that his students should be required to solve problems, rather 
than merely to select alternatives in what one writer calls "place-your-bet” 
questions. 

_Considerable research has been done to determine if obj ectiv e Jectc are 
a dequate The finr1 ing« h^VT bpj»n pnnsjste ntlv favorable to the use oT vgn-' 
constru cted obj ective tests. When a multiple-choice test is made up of cor- 
Tect answers, together with incorrect alternatives chosen from among the 
wrong answers given by students who have answered the same questions 
with free response, the multiple-choice test has a high correlation with the 
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free-response test. One would think that ability to generalize from data could 
be tested only by requiring that the student form his own generalizations. But 
a test carefully prepared by Tyler requiring undergraduates to identify the 
best and poorest generalizations from a set of data correlated .85 with ability 
to draw generalizations directly from the data. Ability to plan an experiment 
is a creative function; yet a recognition test calling for choice among alterna- 
tive plans correlated .79 with ability to make plans without suggestions ( 33, 
pp. 27-30). Tharp found correlations as high as .84 between ability to pro- 
nounce French, tested by a sample of the student’s speech, and score on a 
recognition test (29). Other studies have shown that recognition tests in 
writing and mathematics are valid tests (cf. p. 73). 

There are probably many places where only a free-response test is a good 
measure. If Tyler’s study of ability to plan experiments were repeated with a 
group of graduate students, all of whom had overleamed the verbally stated 
principles of scientific method, probably all would do well on any reasonable 
objective test. But even among such students there is marked variation in 
inventiveness in attacking new problems. Similarly, Tharp’s test sorted stu- 
dents accurately on French pronunciation. But it is doubtful if, in an ad- 
vanced group, fine discrimination between those with authentic and those 
with false accents could be obtained save by a test of performance. While 
good objective tests are dependable for surveys of most abilities and for 
marking, free-response tests are also widely used. Because these are difiicult 
to score, they can only be standardized by the adoption of uniform scoring 
guides. 

5. Distinguish, in each of tlie following types of learning, the relative importance 

of recall or invention, and recognition. 

a. Learning to interpret children’s problem behavior in terms of probable 
causes. 

b. Learning to play bridge. 

6. Discuss the following point of view: 

“I view with some misgivings the pmely utilitarian course in ‘Communica- 
tions’ which has been substituted for the traditional freshman composition 
course at S — . Students’ needs in this course, we are told, are ascertained by 
the administration of ‘batteries of tests.’ I venture to assert that nothing will be 
learned from these tests which a skilled teacher would not find out from a 
single theme and a half-hour interview; and that these would be better for 
the student psychologically, as motivation for the course, than the ‘batteries of 
tests’ ” (11). 

Technical Accuracy 

Altliough it should be obvious that a test is invalid if the keyed answers 
are inconect, some published tests contain a serious proportion of errors. 
These errors result from inaccurate editing, ambiguities, and oversimplifica- 
tion of technical problems to fit the objective-item form. Diamond made a 
thorough criticism of 16 published tests in science. While some were com- 
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pletely free from error, in other tests the number of errors of fact or theory 
rose as high as 9 per 50 items ( 9 ) . 

Meaningfulness of Grade Norms 

Educational tests have commonly been standardized in terms of grade 
norms. A pupil’s raw score is translated into a “grade equivalent,” so that the 
standing of a superior sixth-grader may read as follows: 


Reading 

9.7 

History 

9.2 

Math 

9.0 

Geography 

8.6 

English 

8.7 

Spelling 

7.7 

Literature 

10.3 

Total 

9.1 


This means that, for example, his ability in matliematics equals that of the 
average beginning ninth-grader. Such reports are misleading unless the user 
is aware of the pitfalls in grade nonns. The diflSculties are three in number: 

1. Grade norms are based on more or less representative samples of piipils 
throughout the nation. Some sections of the country are far superior to others, 
owing to differences in pupil ability, differences in the quality of teacliers, 
and differences in expenditures for education and the resulting program. No 
teacher or superintendent from a superior school can take pride if his group 
merely reaches the national norms; no one from a handicapped school dis- 
trict should be condemned if his group cannot attain the national average. 
TT^only f^ir basis for comparing schools is to judge each sclu)oi...ag&inst. 
schools of similar type and resources. Rarely are published norms based on 
such meaningful segments as “New England public elementary schools, in 
cities with population 2000 to 10,000” or “Southern rural elementary 
schools.” 

2. Norms are not “standards.” It is a common mistake to assume that alL 
pupils in the ninth grade should reach the ninth-grade norm. This is of course 
a fallacy; in the standardizing sample, 50 percent of the pupils fall below the 
norm. Furthermore, the test shows only what schools are doing at present. It 
is highly unlikely that the schools are doing so well that the national average 
represents what pupils could attain with the best teaching methods. Tire 
teacher whose class reaches the average has no cause for complacency. There 
is much room for the development of better educational methods. 

3. Grade norms are. based on grossly unequ al and a rtificial units of meas- 
urement. It looks as if the pupil is greatly superior who reaches the “ninth”; 
grade l evel” in g eogS^y when he is only in Grade VI. But nimany standard 
tests this difference in scor e represents vei'y little gain, because the average 
j^il increases his abUity only slightly from Grade VI to Grade IX. In other 
tests, this difference in “grade level” is merely an arbitrary assumption based 
on extrapolation rather than upon testing of pupils in higher grades. 

“Ninth-grade levels” in different subjects are not equally hard for sixth- 
graders to reach. The pupil who is “two years beyond his grade” in a subject 
may sometimes be markedly superior; at other times this represents a stand- 
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ing equaled by a large proportion of his class. Moreover, a two-grade superi- 
ority in one subject is not at all equivalent to a two-grade gain in another. 
The profile of grade scores may give a very different picture of strengtlis and 
weaknesses than a profile of scores comparing the pupil to other sixth- 
graders. 

As in the case of the IQ (see p. 135), it is jwobably wise to replace grade 
jiorms with deriv ed scores, of the standard-score type. Some standard tests 
now give norms in the form of percentile scores in each grade, the Progres- 
sive Achievement Tests being'an example. Percentiles within a single grade 
group do not permit a comparison of pupils in different grades, but this is 
rarely needed. 

Harmful Effects in the Educational Program 

It has been mentioned that the very powers sought in standardized testing 
have been a two-edged weapon, damaging the educational program as much 
as helpin g it.. The e ssential difficulty is that achievement tests determine the 
curriculu m. Both college students and younger devote much of their 

time Jo trying to determine what will be tested, so that they can study tliat 
material and neglect anything else. Students use different study methods, 
concentrate on different materials, and have different accomplishment when 
objective tests replace essay tests (22). The teacher too Js driven by the test 
to diiect his energies into the paths they dictate. A chemistry teacher con- 
sidering the modification of his course to stress applications and reasoning 
rather than technical formulas is deterred if he fears that his class, and hence 
his teaching, will “look bad” when a standard test at the end of the year asks 
about the formulas. The writer recalls visiting a rural school which was 
alarmed because all the boys, upon finishing the compulsory eightlr grade, 
immediately left school to work on the farm. The principal believed they 
should stay in high school, but the boys considered school a waste of time. 
A look at the “literature book” for Grade VIII supplied one clue to the diffi- 
culty. Pupils were being held to selections about a Hindu boy and his vil- 
lage, mountain climbing in Tibet, and other topics of remote interest. When 
the teacher was asked why she did not encourage the boys to develop their 
language skills on bulletins from the agriculture extension service and other 
materials that the boys would consider valuable, her answer was ready: “I 
know this book isn’t good, and the boys don’t like it, but I have to teach it 
because it prepares pupils on topics covered in the standard test given at the 
end of the year by the — Office” ( an agency supervising the schools) . While 
hesitancy in this situation related primarily to desire to avoid criticism, in 
other cases t ests have di rectly forced schoo ls to follo w ri gid curric ula., laid. 
down from above. 'This was”coniidere3 essential in Navy training where 
every man must serve in the highly standardized fleet, but in general educa- 
tion it is deplorable. Many schools have in the past been unable to modify 
their curricula because graduates were forced to pass traditional college en- 
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trance examinations if they wished to go to college. Such tests have been 
liberalized in recent years to permit more flexibility in lower-school curricula. 
Tests encourage sound teaching and sound study attitudes only when they 
measure all the behaviors the teacher wishes to develop, and nothing else. 

When tests have been used to determine whom in the group to find fault 
with, they have bred considerable ill will. A whip motivates, but it builds 
tension. Teachers have, in some cases, been discharged because standard 
tests made their work look poor. Promotion of pupils from grade to grade has 
sometimes rested on the standard test, with ‘no consideration given to the 
remainder of the school record. The administrator who introduces tests can 
make them an ally of teacher and pupil, a way for both to determine how to 
improve themselves. But if he is not careful, he will arouse anxieties which 
reduce the morale and effectiveness of the organization. 

A common error in applying standard tests has been to employ the norms 
without considering what might reasonably be expected in terms of pupil 
ability. Since intelligence varies from person to person and group to group, 
one can only judge whefter progress is adequate by comparing the person 
or group with others of similar potentiality. A common method for making 
this comparison is the accom-pUshment qmtient (AQ), The AQ is the mtio 
of "educational .age” to mental age. Alice, a nine-year-okl with MA 12, 
reacherthe achievement norm for eleven-year-olds. Her AQ is 92 ( 11 -r 12 ) . 
This is an ingenious procedure for identifying those who are “underachiev- 
ing” and “overachieving,” but it ]^s led to so many unreasonable results that 
tlie AQ is in growing disrepute. The weaknesses arise fron i the l imitations of 
^i^^nonns and from, unsound, statistic, aJ..AS5Um.ptions (13, p, 251). Alice, 
for example, can scarcely be expected to be three years ahead of her grade 
in history if she hasn’t been exposed to sixth-grade liistory, so she is no under- 
achiever. Because of tlie se limi tations th ^cco mplishment quoHent should 
be discar de d an^ repla ced by a co^nparison showing how the pupil’s learning 
compares with that of other pupils of similar aptitude or initial ability. Such 
comparisons are usually based on local norms, but national norms of tliis 
type can be prepared. The procedure by which any teacher can make such an 
analysis is described by Trimble and Cronbach ( 32 ) . 

7. What effects, good or bad, would the practices described in the following para- 
graph have on tire education of the pupils? 

“The New York State Education Department, better known as the Re- 
gents, administers uniform examinations, also better known as the Regents, 
semi-annually to all high-school students pursuing key subjects. To prepare 
their students for this ordeal, many teachers abandon the regular textbook in 
favor of a special booklet containing a review of the subject and a reprint of 
recent Regents’ examinations. The general practice is to begin the review 
about four to six weeks in advance of the big test, although some teachers 
start Regents preparation as early as the first day of the term” (2). 

8. What effect would the above practices have on the validity of comparisons of 
different schools on the Regents’ tests? 
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9. What arguments can be adv'anced for and against the policy of a county school 
superintendent who gives a standardi 2 ed battery of achievement tests in every 
elementary school under her jurisdiction, and reports the standing of eacli 
school in each subject to the local newspapers? 

10. Why was each of the difficulties discussed in the foregoing section less signifi- 
cant in the Navy than in public school testing? 

11. In prewar Japan and Germany, the schools were held to curricula rigidly laid 
down in the national education ministries. It has been the policy of the occupa- 
tion in both countries to encourage local initiative in revising courses of study 
and trying new methods of teaching. The purpose is to eliminate stereotypy 
and thought control, to encourage independent thinking of pupils, and to de- 
velop programs adapted to local needs and interests as well as national policies. 
Some leaders propose that psychologists and educators develop standardized 
tests for important types of learning, to be made available to schools through 
the national and regional ministries. The aim of such tests would be to promote 
a high standiud of teaching. Discuss whether the proposal .should be adopted. 

12. "As a general rule, no achievement test printed or revised more than five years 
ago, or any other test more than ten years old, should be used” (34, p. 26). 
Why do you think a writer on educational evaluation would make this state- 
ment? Do you agree? 

TYPES OF ACHIEVEMENT TESTS 

Among the myriad standard acliievement tests, a few broad types appear. 
Some tests measure achie ve men t in a single subject, such as history or sci- 
ence. Batteri es of achievement tests have a lso been p repare d, especially for 
the elementary grad es. S uch batt eri^ consist of sevecal.j 5 .nbtests, eac£ meas - 
uring an area of learnin g. Since the subtests are standardized together, the 
tes t tells h ow a child’s knowledge in one subject compares with his ability in 
other areas. The areas most commonly measured at the elementary level are 
reading, spelling, grammar, history-civics, and the like (see Table 56). Tests 
fo r high schools a nd colleges have also been prepared to give a picture of 
^owth in all areas, but on the whole , single-subject tests have been more 
widely used Aan batteries pf subject tests. 

If one’s purpose is to determine how much science a student considering a 
premedical course knows, his general competence is of more interest than 
his mastery of a particular segment of science. A few recent test batteries for 
high school and college have attempted to measure general educational 
gr owth wi thout reeard-tQ narxQw.srrhjectrinatter divisions. The tests of Gen- 
eral E ducation al Development (G.E.D.) of the U.S. Armed Forces Institute 
are an example. These have been used widely to decide which veterans have 
learned enough while out of school to receive a high-school diploma, even 
though they never “completed” high school. The test battery measures math- 
ematical ability and English expression by more or less conventional tests, 
but its measures in science, social studies, and literature are novel. Instead 
of testing what facts a man knows in science, or what works of literature he 
happens to have encountered, these are “tests of interpretation” in the three 
fields. Passages similar to those in college science texts are presented and 
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questions are asked to deteimine how well the man comprehends them. 
Similarly, he is required to intei'pret social science materials and passages 
from literature. Such a test draws on knowledge, but does not require a man 
to remember specific detailed facts. Since the test resembles the task of col- 
lege work, it should be a good predictor of success in college ( 6 ) . 

In contrast to survey tests which estimate the person’s general attainnient 
in an area, some diagnostic' tests have been prepared, notably for reading and 
mathematics. A diagnostic test attempts to determine, not the level attainedj 
Jbut the specific sti'engths and weaknesses of the student. Diagnostic tests 
focus on the process by which the subject responds, rather than the product. 
Diagnostic procedures in reading will be desci'ibed below. Diagnostic pro- 
cedui'es are sometimes used in any teaching, whether of music, piloting a 
plane, winding armatures, or clinical psychology. The teacher cannot teach 
effectively unless he knows just what is wrong with the student. JBut, since 
the intra-individual comparison is stressed rather than comparison, with 
others, diagnostic methods have rarely been standardized. 

Most published tests have measured knowledge, neglecting other be- 
haviors. Ampijg.the most promising developments in achievement testing 
are several new measures of other objectives, notably ability to apply prin- 
ciples, reasoning skills, and ability to interpret data. In addition to these, 
some schools have used interest tests and attitude tests (Chaps. 15, 16) to 
study pupil progress. 

Tests of ability to apply principles are unique in that they ask the pupil 
not to state or recognize a fact, but to solve a new problem where that 
principle is relevant. If a pupil can determine the correct solution to a “life” 
situation, not previously studied, and defend it with a sound scientific prin- 
ciple, it is certain that he understands the principle. 'The most complex and 
highly developed tests of ability to apply principles in science and social 
studies are those of the Progressive Education Association, made under 
Tyler’s direction (26). Examples, related to most school subjects, are pre- 
sented in The measurement of understanding (16). Few standardized tests 
have employed this technique as yet, but the following illustrations are 
taken from instruments developed by the Cooperative Test Service: 

“The walls of a living room wliich is to have indirect lighting should be fin- 
ished in 

“1. a glossy, rather dark paint 

“2. a rough, rather dark wallpaper 

“3. a rough, brightly colored wallpaper 

“4. a shiny white paint 

“5. a rough, cream-colored stipple or wallpaper”^ 

“Experiment. Coal gas which has not been previously mixed with air is burned 
at a gas jet (no. 1) . At a similar jet (no. 2) the coal gas is mixed with air before it is 
burned, 


'■ From the Cooperative General Science Test, Revised Series, Form Q. 
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“Probable results (Mark those which explain tlie result. . . 

“1. Incomplete combustion leaves some uncombined carbon in the flame. 

“2. The presence of nitrogen retards combustion. 

“3. Particles of iincombined carbon glow when heated. . . 

The essential point in tests of this type is that the question must present a 
problem unfamiliar to the student, without telling him what principle to use. 

Tests of reasoning ability are used in intelligence tests, but certain types 
of reasoning are directly taught in school and can he tested as achievements. 
Particularly in courses stressing propaganda analysis, scientific thinking, or 
logical proof, learning can only be measured by presenting the pupil with 
ar gume nts to be analyzed. The Nature of Proof tests of the Progressive Edu-. 
cation Association are an example, (26, pp. 126-154). A typical problem 
describes a nem report regarding deaths from typhoid in a town where 
water was polluted. A conclusion is suggested that the sickness of one man 
living on the stream above the town caused the widespread illness. Students 
are then given many reasons (taken from previous student papers) which 
they are to judge as supporting the conclusion, contradicting the conclusion, 
or irrelevant. “Good doctors should be available when an epidemic hits a 
small town” is irrelevant. “Typhoid fever germs are active after being carried 
for about half a mile in clear running water” is necessary in proving the con- 
clusion. After studying and marking tlie reasons, the student decides whether 
the conclusion was proved. A niunber of separate scores are obtained from 
the complete set of several problems. Scores include “number of accurate 
conclusions,” “number of irrelevant reasons called supporting,” and so on. 
This is consistent with the belief that reasoning ability is not a unit, but 
includes many different behaviors. Reliability coefficients for the Nature of 
Proof scores range from .82 dovra to .20 ( 26, p. 517), which indicates that 
the present form of the test is not adequate for diagnosis of logical thinking. 
It can be a useful teacliing instiument, since it focuses attention of pupil and 
teacher on an important type of behavior. 

The third important new test is Interpretation of Data, a PEA insti'ument.® 
Exercises require the student to make inferences and interpretations from 
tables, charts, maps, and other forms of presentation. Data deal with a wide 
range of interesting and unfamiliar situations, from baseball averages to 
diet. Students are given numerous suggested interpretations, which they rate 
as “true,” “probably true,” etc. 

13. If admission to college depends in part on a test which stresses knowledge of 
historical facts, what sort of course could a teacher give high-school seniors 
which would improve their chances of passing? What sort of social studies 
course could he give if admission is based on the G.E.D. tests? 

14. The G.E.D. tests require up to two hours to obtain a single score, whereas the 

® From the Cooperative Chemistry Test for College Students, Form 1940. 

“ Now published by the F.ducational Testing Service. 
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Progressive Achievement battery obtains about 20 scores in two hours. Are 
these equally good measurement practices? 

15. Discuss the argument: "The USAFI tests of iiiteipretation are measures of 
intelligence and reading ability rather than educational development.” 

16. Would the Nature of Proof test be valid as a measure of intelligenc<^ of adoles- 
cents? 


Table 56. Representative Achievement Test Batteries 


Test and Publisher Grade Level Content 


Comments of Reviewers" 


Cooperative Tests 
(Educ. Testing 
Service) 


Metropolitan Achieve- 
ment Tests (World 
Book) 

Progressive Achieve- 
ment Tests (Calif. 
Test Bureau) 


Stanford Achievement 
Tests (World Book) 


Iowa Tests of Educa- 
tional Development 
(Sci. Res, Assoc,) 


U.S. Armed Forces In- 
stitute Tests of Gen- 
eral Educational De- 
velopment (Educ. 
Testing Service) 


High school; 
college 


Grades 1 ; 
2-3; 3-4; 
4-6; 7-8. 
Grades 1-3; 
4-6; 7-9; 
9-13 


Grades 2-3; 
4-6/ 7-9 


High school 


High school; 
college 


English, lit. comprehension, lit. 
acquaintance, social studies, 
nat. sciences, math , con- 
temporary affairs 

Reading, vocab., arith. funda- 
mentals, history, spelling, 
etc. 

Vocab., reading, moth, rea- 
soning, math, fundamentais, 
languoge 


Reading, language, arith., soc. 
science, spelling, etc. 


Social concepts, background 
in science, writing, quanti- 
tative thinking, interpreta- 
tion of reading materials, 
etc. 


Mathematics, expression, in- 
terpretation of reading in 
soc studies, etc. 


Eng.: “Excellent for survey. . . . 
Not individually diagnostic" 
(Leonard, 5). 

Lit. comp.: “Question whether 
test measures anything of con- 
sequence" (Roberts, 5). 


“Diagnostic features excellent. 
. . . Conform with the best 
practices in test construction 
and standardization" 

(Witty, 5). 

“Among the best of their type. 
. . . Scarcely fulfill all the 
claims made for them" 

(Odell, 4, p. 3B). 

“[Certain portionsl not surpassed 
as tests of skill subiects. , . . 
Subtests covering content 
fields not abreast of curricular 
changes" (Preston, 5). 

“Probably best battery avail- 
oble for use in senior high 
school" (Chauncey, 5), 

“Intended to measure relatively 
permanent changes in broad 
aspects rather than specifics" 
(Hanna, 5). 

"Quality of Items high. . . . 
Absence of validity data de- 
plorable. . . . One chief 
question is whether tests 
measure academic achieve- 
ment independent of general 
intelligence or general read- 
ing ability" (Conrad, 5). 


The reader should refer to the source cited fur a full statement of the reviewer's opinion. 


READING TESTS 

Reading deserves special attention in a book of this type for two reasons. 
First, reading tests have been developed in greater number and variety tlian 
any other type of achievement test and demonstrate numerous problems in 
test construction. Second, they are probably used more widely i n guidance 
and clinical practice than other achievement tests. 
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Definition of Abilities 

At a glance, reading seems like a clearly defined skill which could be 
readily measured. Actually, no area illustrates more clearly than reading 
that tests having the same name measure quite different behaviors. Different 
authors disagree on what is included in the area and on the most useful 
definition of rate, comprehension, word knowledge, etc. for testing purposes. 
One author examined 24 reading tests and found that between them they 
measured 48 different skills (31 ). This does not mean that reading is a com- 
plex of so many specific abilities, however. One test claimed to measure 
several “entirely different” reading skills, but the correlation of scores showed 
that these different abilities were actually the same function measured over 
and__Q_ver under different names (14). One way of clarifying which reading 
abilities are truly different is factor analysis. In 25 tests of reading and study 
skills, the follmw^ common factors were found: tendency to read carefully 
(an attitude or response set), inductive reasoning, rate of reading, verbal 
abili^, vocabulary, rate for unrelated facts, and chart reading ( 14 ) . In view 
of the variety of test content the person who needs a test of reading must be 
careful to de fine what reading ability he wishes to measure, and must tuim to 
dependable sources ”( 4, 5, 20, 31 ) for a trustwortKy analysis of possible tests. 

Survey Tests 

S urvey tests are ordinarily intended to represent, in one to three scores, _ 
the suly'ect’s general level of reading development. They are used to screen 
■people who could be helped by remedial teaching,, tp_ predict success in 
courses, and to check whether poor reading explains a subject’s poor per- 
“fbrmance on a group mental test. 

Reading development includes both speed of reading and comprehension, 
and a useful tes t mu st consider both these elements. Most testers have tried 
to develop tests which can measure independently the two aspects of per- 
formance, and they have been larg ely imsuccessful. This problem occurs in 
most testing, but rarely is it so easily recognized as in reading: When an act 
contains several integral aspects, one cannot divide the act into fragments 
for testing purposes. This principle has been illustrated in Binet’s finding 
that significant intellectual performances could only be measured by com- 
plex tasks, and in the finding that complex coordination tasks resembling 
a job are often better predictors than any combination of elemental dexter- 
ities. 

In theoiy , the way t o sep arate speed and comprehension is to hold one 
. co nstant while the other is m easured , Sj^id can be minim ized by giving a 
test with out a .time limit: the subject’s understanding of what he reads should 
be a pure measure of comprehension. Rate is much, harder to.roeasuro. Every 
person has a variety of reading rates, changing with his purpose and the 
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material read. Good readers tend to shift their .speed as difiBculty increases 
whereas poor comprehenders seem to use a ratlier unvarying rate ( 1 ) . The 
problem for the tester is: How can we have every student read the same 
passage with the same degree of concentration and the same effort to com- 
prehend? Rate of reading can be raised so easily by becoming superficial that 
the final result is ambiguous unless effort to comprehend can be controlled, 
^e usual technique ta control comprehension i^tp require the_ subject to 
answer questions about .what he has read, or to cross out absurdities as he 
reads. The former technique is employed in the Iowa Silent Reading Test 
and the Nelson-Denny test. The Nelson-Denny method is illustrated in Fig- 
ure 69. The score is the number of questions answered correctly in a set time. 


The government of Henry the Seventh, of his 
son, and of his grandchildren was, on the whole, 
more arbitrary than that of the Plaiitageiiets. Per- 
sonal character may in some degree explain the 
difference; for courage and force of will were 
common to all the men and women of the House of 
Tudor. They exercised their power during a pe- 
riod of one hundred and twenty years, always with 
vigour, often with violence, sometimes with cru- 
elty. They occasionally invaded the rights of the 
subject, occasionally exacted taxes undei the name 
of loans and gifts, and occasionally dispensed with 
penal statutes; Nay, though they never presumed 
to enact any iicrmanent law by their own author- 
ity, they occasionally took upon themselves, when 
Parliament was not sitting, to meet temporary exi- 
gencies by temporary edicts. It was, however, im- 
possible for the Tudors to carry oppression beyond 
a Certain point, for they bad no armed force, and 
they were surrounded by armed people. Their pal- 
ace was guarded by a few domestics, whom the 
array of a single shire, or of a single ward of 
London, could with ease have overpowered. These 
haughty princes were therefore under a restraint 
stronger than any which mere law can im|)osc. 


1. With whom is the paragraph chiefly concerned? 
1. The Tudor Kings. 2. The Plaiitagenet Kings. 
3. Tlic palace guards. 4. The Ix>ndoii populace. 
5. The English people. 

2. How were new laws secured when Parliament 
was not .sitting? I, They were made by a single 
London ward. 2. Old laws were revived. 3. They 
were made by the army. 4. The King issued an 
edict. S. They were made by the Princes' coun- 
cil. 

.3. Under what guise were taxes sometimes col- 
lected? 1. Fines. 2. Loans. 3. Tariffs. 4. Com- 
missionb, 5. Rale of public ufflees. 

4. What personal tialt was always displayed by 
the lule of the dynusty discussed in the para- 
graph? 1. Cowardice. 2. Deceit. 3. Vigour. 4. 
Vacillation. S, Forbearance. 


Fig. 69. Specimen item from Nelson-Denny Reading Test* 


In the Iowa test, the student reads first for a set period of time, which yields 
his rate score, and then turns the page and answers the questions, which 
give a comprehension score. The absurdities technique is represented by the 
following item from the Dvorak-van Wagenen test: 


Directions: Read each paragraph carefully, find tlie word in the last half that does 
not fit in with the meaning of the rest of it, and make a cross through it. 

Mrs. Glenn has been buying a lot of new furniture for her house. Yesterday the 
furniture truck came with a new table, a new hat, and several new rocking chairs. 


1. The scpres-Qii-rate and comprehen sion a re of ten interdependent, so that 
*ep^il can raise on e at th e expense of the otherrWhen only a single “rate 

pa^ Reading Test for Colleges, Houghton Mifflin Com- 
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of comprehension” score is obtained, the score is influenced by the subject’s 
tendency to read cautiously which will raise or lower his score relative to 
others. 

2. Tests supposed to measure comprehension are often heavily speeded, 
so that tliey are strongly influenced by rate of reading. Such tests are of litt le 
valu£^for diagnosing, although they are good predictors. 

3. The reading test covers only a selected range of content, yet reading 
a bility varies somewhat with diflFerent materials. Some people can read his- 
tory well but not science; some do well on stories, poorly on textbooks. The 
survey test includes a sample of a broad or narrow type of content; the way 
the content is selected makes the test more useful for predicting some types 
of performance, less useful for others. 

4. Many tests measure only a limited type of comprehension. The skilled 
reader mustHbf only be able to make sense of sentences, as in the cross-out 
item illustrated above. He must also be able to take the main idea from a 
long passage, put together ideas in separate sentences, follow a logical argu- 
ment, and so on. Some reading tests measure only the simplest type of com- 
prehension, whereas others demand deep and thorough interpretation. A 
survey test cannot measure all types of comprehension separately;, which 
mak^ it impoffanrtb use a test measuring the type of reading with which 
one is concerned. 

17. Examine the questions used in the Nelson-Denny selection (Fig. 69), and 
decide what mental activities each calls for. Justify or criticize this meaning of 
comprehension in a survey test for college readers. 

18. As an evidence of validity, the Nelson-Denny manual reports a high correlation 
of the test with freshman marks. Does this demonstrate that the test is valid? 

19. Could a student attempt to answer the questions by reading the queistions first 
and looking up the answers, rather dian first reading through the whole para- 
graph? Would this procedure affect the score obtained? 

20. For what predictive and guidance purposes would it be most desirable to have 
each of the following tests? 

a. A test of ability to read selections drawn from college textbooks in a variety 
of fields. 

b. A test of ability to read selections drawn from college textbooks in physics 
and mathematics. 


Diagnostic Methods 

A diagnostic ach ievement test at its best is an in^ressive tool. With or 
without such tests all teachers must at times study their students to deter- 
mine why they are having difficulty. The diagno s tic test trie s t o increase the 
eflBciency of fliagnnsis by bringing tpjhe leapher^g. aid all the insights avail- 
able from rese arch on learning failures. An ideal diagnostic achievement test 
calls the teachers attention to every aspect of the reading process wherein 
the pupil might have stumbled. By checking off one at a time . the. jnany 
sorts o f possible error, the teacher is left, with a picture of the specific 
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weaknesses that must be remedied before the pupil can make normal prog- 
ress. 

Such a diagnostic procedure must be based on extensive research to deter- 
mine what types of errors are made. Once the errors are listed, it is necessary 
to devise test procedures to reveal clearly which errors the pupil makes, .pi- 
agnostic methods have been standardized for arithmetic and a few other 
school subjects, but they have reached their highest development in reading. 
Reading specialists have available a great variety of diagnostic techniques, 
and a few of these have been organized into batteries sufficiently simple for 
classroom teachers to use. The two most widely known methods of individual 
diagnosis are the Monroe-Sherman tests, which include both aptitude and 
achievement measures, and the Durrell Analysis of Reading Difficully. The 
latter will be described here. 

Durrell bases his tests on study of the reading errors made by 4000 sc hool 
children. His tests provide an opportunity to observe the child at work m 
oral and silent reading, and in special tests. In this respect, they are like 
the observations of performance discussed later in the chapter. The first 
tests deal with oral reading. The tester notes the time required to read the 
standardized paragraphs, and notes whatever errors occur. Silent reading is 
then checked on a set of paragraphs of equal difficulty to the oral series. 
Questions are used to check recall, and the teacher observes such reading 
habits as visible lip movement. A flash-exposure device is used to show words 
briefly; this indicates perceptual habits and errors. Finally, there is a phonetic 
inventory for children who have difficulty in word perception. The^ a naly sis 
is not a mechanical device — ^it calls for great skill and keenness o f observa- 
tion by the tester. In the oral tests, the tester must record phrase reading, 
hesitation on words, mispronunciation, omission of words or syllables, ignor- 
ing of punctuation, and enunciation. The virtue of the test is that it presents 
rnaterials of standardized difficulty, and tlrat the check list of errors calls the 
tester’s attention to all the significant facts that may be noted. 

The type of information that comes from a careful diagnosis is illustrated 
by Durrell’s report on Anthony, age 9-8, in tlie fourth grade. His Binet MA 
was 9-4, and his IQ was 97, but his general reading achievement was at the 
low second-grade level. 

“On the Durrell Analysis of Reading Difficidty, Anthony made a low second- 
grade score on oral-reading tests, but seemed quite unable to keep his attention on 
silent reading. He did poorly on quick perception of words, and had no method of 
word analysis. He read a word at a time in a sti-ained voice and a monotone. He was 
markedly insecure in his reading and repeated words continually. He was 
unaware of the errors in his reading, indicating a lack of concern about meaning. 
When his errors were corrected in his oral reading, his comprehension was ex- 
cellent. 

“The silent reading was marked by a high rate at the expense of mastery. He 
skipped all the hard words; as a re^t his recall was scanty and inaccurate, al- 
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though he did the best he could with it. Strictly speaking, he did not read silently 
at all, since his reading was accompanied by constant whispering of the words, 
vague sounds being given for the difficult words. His eye movements in silent read- 
ing were irregular and unrhythmic, with seven to ten per line and many regressive 
movements” (10, p. 333) . 

Another group of diagnostic tests is much simpler, being intended for 
group administration. These contain subtests presumed to measure various 
types of reading ability. Performance is represented as a profile showing the 
relative strengths and weaknesses of the pupil. The Iowa Sileirt Reading Test 
is m nong the most prominent of these tests. It reports nine scores, based on 
these subtests: 

Bate of reading. The pupil reads two short articles, marking his place when 
a signal is given. He then answers simple questions on the passages. Both 
rate and accuracy scores are reported. 

Directed reading. The pupil answers questions about paragraphs by indi- 
cating the place in the paragraph where each question is answered. Tliis 
and following tests are separately timed, with most pupils having insuffi- 
cient time to finish. 

Poetry. The pupil reads poetic selections and answers questions about their 
meaning. 

Word meaning. Four lists of words are presented, in social science, science, 
mathematics, and English. With each word is presented five alternatives, 
the pupil choosing the best synonym. This test is conducted in four 90- 
second sections, the pupil going from one list of words to the next when 
time is called. One score is obtained on the section. 

Sentence meaning. The pupil reads sentences dealing witli the meaning of 
words and simple information, marking each true or false. 

Paragraph meaning. The pupil reads paragraphs and answers questions 
about each. Items call for both direct memory and recognition of central 
thought. 

Use of index. An abbreviated index is presented, and the pupil reports which 
pages of the book would contain answers to certain questions. 

Key words. The pupil indicates what words he would seek in an index in 
order to obtain information about such questions as, “What is the value of 
our annual supply of mineral products?” 

21. Reliabilities of the subtests for the Iowa test are not reported except by split- 
half methods which are incorrect for speed tests. What possible fallacy is in- 
volved in interpreting a profile without knowing the reliability of subtests? 

22. In what ways would the test be more or less satisfactory if all parts except 
“Rate of reading” were given with no time limit? 

23. Since the entire Iowa test requires more than one hour to give, it is sometimes 
desired to omit certain sections. Which parts would be most valuable to help 
plan remedial training for a college student having difficulty due to slow 
reading? 
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Table 57 lists a small sampling of die great variety of reading tests on the 
market. Not only do the tests vary in design, but the authors of tests have 
widely differing opinions as to the most useful subdivisions of the reading 
act for testing purposes. McCullough, Strang, and Traxler list 17 questions 
that need to be answered in choosing a group reading test for school use, 
and make it clear that several tests do not measure up to these criteria ( 20, 
pp. 133 ff. ) . They go on to say, “There has yet to be constructed an entirely 
satisfactory silent-reading test, and it is quite possible that when such a test 
does appear, no one will buy it except for use in clinical situations. A truly 
adequate test would have to be too long, with too many parts to score and 
too many kinds of material, for any school to find time to use it on a school- 
wide basis.”” 

24 , If survey tests give an inadequate description of reading difficulties, and if 


Table 57. Representative Reading Tests 


Test and Publisher 

Grade Level 

Time 

Scores Reported 

Comments of Reviewers" 

Cooperative Reading 

7-1 2, col- 

40 min. 

Vocabulary, speed of 

"The level of comprehen- 

Comprehension 
(Educ. Testing 

Service) 

lege 


comprehension, level of 
comprehension 

sion score is not com- 
plicated by rote ex- 
cept for certain stu- 
dents. . . . Advan- 
tages of such 0 score 
in clinical work, instruc- 
tion, and research are 
obvious" (Stroud, 5). 

Diagnostic Reading 
Tests (Educ. Rec. Bu- 
reau) 

7-13 

For sur- 
vey, 40 
min. 

Survey: rote, comprehen- 
sion, vocabulary. Diag- 
nostic: vocabulary, in 
five areas; comprehen- 
sion, silent and audi- 
tory; rate for stories, 
soc. studies, science; 
word attack 


Durrett Anatysis of 

1-6 

About 50 

Tester notes speed and 

"No reliability coeffi- 

Reading Difficulty 
(World Book) 


min. 

specific errors in oral 
and silent reading, 
flash perception, 
phonetics 

dents. . . . Little con- 
cerning validity is 
cited. . . . Author 
correctly emphasizes 
usefulness of test for 
standard observation 
of errors" (Tinker, 4, 
p. 341). 

Dvorak-van Wagenen 

4-6; 7-9; 

About 2 

Rate of comprehension; 

"Reliability of some of 

Diagnostic Examina- 
tion of Silent Read- 
ing Abilities (Educ. 
Test. Bureau) 

1 0-college 

hrse 

ability to perceive re- 
lationships; vocabulary, 
In context and isolated; 
six other scores 

the ports low and in- 
tercorreiotions high” 
(20, p. 132). 


Tlie reader should refer to tlie source cited for a full statement of the reviewer’s opinion. 


® This statement was written before the appeararice of the Diagnostic Reading Tests, 
on wliicli two of these authors collaborated. 






MEASURES OF ACHIEVEMENT 
Table 57. Representative Reading Tests (Continued) 


293 


Test and Publisher Grade Level Time Scores Reported Comments of Reviewers” 


Gates Reading Tests 
(Teachers Coll., 
Columbia Univ.) 


Gray Oral Reading 
Paragraphs (Pub. 
Sch. Publ. Co.) 

Iowa Silent Reading, 
new edition (World 
Book) 


Monroe-Sherman 
Group Diagnostic 
Aptitude and 
Achievement (C. H. 
Nevins Co.) 
Nelson-Denny 

(Houghton Mifflin) 
Sangren- Woody 
(World Book) 


Various^' 35-60 
min. 


1-8 About 10 

min. 


4-9/ 9-1 3 49 min. 


Age 8-15 


High school, 30 min. 
college 

4-8 27 min. 


in survey test for grades 
3—1 0: level of com- 
prehension, vocabulary, 
speed 


Tester notes speed and 
specific errors made 

Rate, comprehension, 
word meaning, para- 
groph meaning, five 
other scores 


Paragraph meaning, 
speed of reading, word 
discriminotion/ seven 
scores on visual, audi- 
tory, motor aptitudes 
Vocabulary, paragraph 
meaning 

Word meaning, rate, fact 
material, total mean- 
ing, central thought, 
following directions, 
organization 


Primary: 'Test exercises 
reasonably valid. . . . 
Can be used for gen- 
eral survey or diag- 
nostic purposes" 

(Gray, 5). 

Survey: "Most valuable 
survey-type reading 
test at present time. 
. . . Where further 
diagnosis is desired, 
may be supple- 
mented” (Holberg, 5). 

"Observations may have 
considerable diag- 
nostic value" (Kopel, 
4, p. 368). 

"Primarily a survey test. 
. . . Rate of work af- 
fects scores on all 
parts. . . . Merits at- 
tention and study” 
(Booker, 4, pp. 

347 ff.). 

"Appears to reward 
pupil who reads super- 
ficially and parrots 
glibly” (Davis, 5). 


"Admirably suited both 
for diagnostic pur- 
poses and compre- 
hensive survey" (Live- 
right, 4, p. 364). 


Giitcs has produced a great variety of tests adapted to specific purposes and particular grade levels. 


diagnostic tests are too complex for widespread use, how can reading tests be 
used most effectively in the school? 

25. In what respects is a gi'oup of the Iowa type inferior to an individual test for 
diagnostic purposes? 

26. The following paragraph is from the manual for the SRA Reading Record: 

“Reliabilities were computed on the record for one thousand ninth- and 
eleventh-grade students. The median reliability for these two homogeneous 
groups is: Test 1, .79; Test 2, .75; Test 3, .86; Test 4, .96; Test 5, .86; Test 
6, .83; Test 7, .96; Test 8, .79; Test 9, .78; Test 10, .75; and Total, .93. In 
terms of these coiTelations, the scores on the record are reliable enough for 
individual as well as group diagnosis. Individual students may be guided 
into advanced reading or placed in remedial classes with the assurance that 
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such decisions are based on accurate facts about their reading skills.”'’’ 

To interpret the above facts one also needs to know that each .subtest is 
timed, and that the separate tests presume to measure separate aspects of 
reading, even though their correlations are “moderate to high.” Reliabilities 
were determined by the split-half method. What facts about reliability have 
been ignored, in judging this test to be accurate enough for individual di- 
agnosis? 

TESTS OF SKILLS IN PERFORMANCE 

It is important in vocational courses and in employment to measure skills. 
All courses teach skills ( reading, interpretation of data, etc. ) , but skills are 
particularly central in typewriting, comptometer operation, shopwork, blue- 
print reading, dressmaking, music, and so on. Measurement, of maximum 
ability to perform is based on the principle of the worksample.’ One either 
obtains a sample of the subject’s product and rates it, or observes and ju^es 
his performance as he works. 

Product Rating 

For product rating, we must compare similar specimens of the best work 
of each subject. Sampling is a difficulty, since a single object may not repre- 
sent the learner fairly. To compare several people, it is desirable to have 



1 

2 3 

Score 

Appeorgnee 

1. Shriveled 

Plump and slightly moist 

1. 

Color 

2. Pale or burned 

Well browned 

2. 

Moisture Content 

3. Dry 

Juicy 

3. 

Tenderness 

4. Tough 

Easily cut or pierced with fork 

4. 

Taste and Flavor 

5. Flat or too highly seosoned 

Welt seasoned 

5. 


6 , Raw, tasteless, or burned 

Flavor developed 

6 . 


Fig. 70. Score card for rating sample of cooking.® 


them work on similar material. One standard test in stenography accom- 
plishes this with a recorded dictation which the subject must take down and 
transcribe. This method holds constant not only the difficulty of material, 
but also the speed of dictation and clarity of speech. McPherson standardized 
a test in simple woodworking by requiring each boy to consti’uct a wood 
block like a model. The block was designed to demand use of saw, drill, and 

“ From G. T. Buswell, SflA Reading Record, Examiner Manual, p. 3. Reproduced by 
permission of Science Research Associates. Copyright 1947, by Science Research As- 
sociates. 

^ Many of tire methods described can also be applied to the study of typical behavior, 
if test motivation is absent. Methods discussed in Chapters 18 and 19 deal primarily 
with typical performances. 

* Clara M. Brown and others. Food Score Cards, University of Minnesota Press, 
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chisel. Scoring was done objectively by imposing a plastic pattern on the 
block to check dimensions ( 21 ) . 

Objectivity in scoring is aided by a check list or rating scale. This forces 
all judges to notice the same features of the sample and to use a comparable 
numerical scale. Two product rating forms are illustrated in Figures 16 
and 70. 


Tests of Performance 

Tests of performance, or worksamples, are needed for skills where a 
sample of the product is not an adequate index of ability. The standard 
typing test in civil service practice is such a test, indicating both speed and 
quality of performance. Some tests may use regular factory or shop equip- 
ment, while others use special apparatus. Use of normal equipment is illus- 


CONTROL USE 

i £ § 

CSV 
lu y- oe 

PRECISION 

Simultaneous 

[X] □ g] 

f Constant....a 

Successive 

DSD 

Bank: \ 

1 Varies 

Slips 

man 

A. ClimhtnQ fipnftd is.., AAPH 

Skids 

□□□ 

B. Correct climbinq 
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CLIMBING TURN 

This is a maneuver that calls for proper coordination of all controls. 

A. SPEED is the most common speed shown by the indicator during the turn. 

B. The “Correct climbing-speed reading” during the turn is determined from the 
air-speed indicator in the same way as the climbing speed in a straight climb. 

Fig. 71. Portion of Ohio Stote Flight Inventory, used for recording 
performance in a climbing turn (23). 

trated by a test of ability of packers in a cannery. Production on the job 
could not be used as a test, because of variable factors from day to day and 
because the job was normally affected by teamwork. For test purposes, one 
conveyor belt was set aside and one worker at a time assigned to it. A count 
was made of the number of cans he packed per hour ( 27, p. 86 ) . 

In worksamples, as in some psychomotor tests, it must be recognized that 
a score represents only maximum ability at' the present point in a learning 
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curve. A score earned by one worker on a machine may be the highest he will 
ever attain, since he has had much training; another worker, with a lower 
numerical score, may in the long run be a better worker, in view of his pres- 
ent lesser training or experience. 

Special equipment is used to obtain a worksample where regular equip- 
ment cannot be used, because of either cost or danger. It is essential for 
every submarine crewman to learn to use tiie escape hatch of his ship, in 
case it should sink in shallow water. The only sure test of ability is to have 
liim tiy to use it, but it is obviously impossible to make the test at sea. To 
test (and to train) crewmen on shore, a deep tank is used in which a replica 
of the escape hatch is built. Since this is in all essential features like sea 
conditions, a valid worksample can be obtained to determine when a man has 
attained an adequate skill. 

Motion pictures have occasionally been tried as a method of obtaining 
worksainples of ability to observe. Ability of aerial navigators was tested by 
showing them a motion picture, taken from a plane, which showed a view of 
the ground and of the essential instruments. Aided by a map, students were 
to make a plot such as they would in flight. Like many worksample tests, this 
proved to have little reliability and was not adopted f 7 1 . Low reliability is 
characteristic of wor ksa mples wh ere one error m^y disturb the se- 

quence. oLpe rformanc e. All worksampTe" tests must represent normal condi- 
tions of work and a large enough sample of performance must be obtained. 
Motion-picture techniques were successful in obtaining reliable measures of 
many skills important in the Array Air Force. These successful tests, how- 
ever, usually included a large number of short, similar items, rather than a 
few complex sequences of performance ( 12 ) . 

Evaluation of performance is aided when an objective record is made of 
what the subject does. Check lists or rating scales are helpful, especially to 
record how the man performs, his style and methods of work, and what 
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errors he makes. Such a 
record permits concentra- 
tion on weaknesses in later 
teaming. An example of 
such a check list is the 
Ohio State Flight Inven- 
tory, designed to aid in re- 
search on pilot training. 
While the pilot followed a 
prescribed maneuver, the 
instructor riding with him 
checked his procedures on 
a form such as that shown 
in Figure 71. In all, 640 
items were to be checked 
(23). 

At times a performance 
is too rapid or too subtle 
for an observer to identify 
the correct and incorrect 
actions. Psychologists have 
found mechanical record- 
ing devices helpful for 
measuring and teaching 
such skills. An example 
from industrial training is 
provided in Lindahl’s study 
of disk-cutter operation. 
The difference between 
good and poor performers 
was found to lie in the 
speed with which they 
went tlirough each phase 
of the cycle of operation. 
The operation called for 
pressing a pedal to drive 
the cutting wheel, and re- 
leasing it for a new cut. 
Lindahl devised the re- 
cording device shown in 
Figure 72, which yielded 
records such as that in Fig- 
ure 73. These objective 
records showed not only 






Fig. 73. Improvement in the foot-action pat- 
tern of a trainee (19). The record at the top shows 
long pauses between strokes, uneven speed during 
the cutting (downstroke), and jerky foot action at 
the end of the stroke. All these faults were elimi- 
nated in the final record. 
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which workers were the best producers, which could have been judged by 
the quality of the product, but also exactly what errors each was making. 
The records also provided a means of teaching the worker what errors he 
was making and helping him recognize the “feel” of the pedal when he was 
doing the act correctly. 


Table 58. Achievement and Aptitude Tests Used in R. H. Macy & Company, Inc." 


Calegory 

Test 

Remarks 

General ability 

Wonderlic adaptation of Otis Seif* 
Administering 

For non-executives 


Mental ability 

Difficult one-hour test for prospective 
executives 

Arithmetic 

Fractional arithmetic 

Problems related to salesclerk's duties 


CMC arithmetic 

More difficult test of fundamentals for 
merchandise control clerks 


Arithmetic reasoning 

For supervisors^ and senior positions 
in merchandise control 


Mental Arithmetic 

For tellers in Depositor's Account 
Division 

Merchandise Knowledge 

Wine and liquor 

For salespersons in liquor department 

Office skills 

Thurstone typing 

Rough-draft typing and tabulation 


Bureau of Adjustments typist 

For workers filling in small forms 


Blackstone typing 

Stenography 

Simple copying for junior-typists 


Comptometer adding 

Adding long columns on comptometer 


Filing 

Alphabetical filing test for experi- 
enced fliers 

language ability 

Spelling 

For typist correspondents and secre- 
taries 


Correspondent's test 

Worksample; subject writes letters 
answering inquiries from customers, 
following directions indicating 
proper content for letter. Graded 
on composition and grammar. 


Speed and accuracy; same-different; 

Tests of clerical perception and ac- 


comparison of signatures and 
printed names 

curacy for various jobs 

Manual ability 

Minnesota Rote of Manipulation 

For kitchen employees 

Vision 

Ishihara color vision 

For guards and stockmen handling 
colored merchandise (e.g., men's 
ties) 


« Information simplied by Patricia Schnrf, of Macy*s Personnel Division. All tests except the Won- 
derlic, Thnrstono, Blackstonc, Minnesota, and Ishihiurn were spacinlly constructed by Mney’s. 


27, Outline a plan for obtaining product ratings and performance obsei'vations for 
each of the following situations. In each case, discuss the relative merit of the 
two procedures. 

a. Testing a hoy’s knowledge of how to wire batteries in series. 

b. Testing the improvement in technique of a concert violinist. 

c. Testing ability to operate a calculating machine for all types of operation. 

ACHIEVEMENT TESTS IN EMPLOYMENT PRACTICE 

Achievement tests for hiring employees are quite varied, since each one 
must deal with the skills or knowledges of a particular job. Except in the 
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clerical field, standard tests are rarely used. So long as the requirements of 
jobs in several places are similar, there is no reason why achievement tests 
for specific jobs should not be standardized. A few such tests have made 
their appearance, an example being the series of Purdue Vocational Tests," 
dealing with machine shop, electricity, and industrial mathematics. The tests 
are more precise and more difiScult tlran simple oral trade tests. 

The wide range of achievement tests usable in hiring and promotion is 
illustrated by the list of tests used by one organization, R. H. Macy & Com- 
pany. The tests used in that store are summarized in Table 58. Noteworthy 
are the variety of tests and the adaptation of the tests to the specific demands 
of jobs. Such tests can be used for placing an applicant in the most suitable 
job, which is an advance over the policy of hiring “typists” and placing them 
at random in clerical jobs without forther testing of specific abilities. 

28, Classify the tests in Table 58 as to whether they measure aptitude or achieve- 
ment. 

29. Some clerical aptitude tests include a vocabulary section. What advantages 
would there be in measui'ing “office vocabulary” rather than vocabulary in gen- 
eral? Which of the following words would be most suitable for an “office 
vocabulary” test: beneficial, indemnity, allegro, lament, kingly, itinerary? (All 
words from SRA Clerical Aptitudes Test.) 

ACHIEVEMENT TESTS IN THE SCHOOL PROGRAM 

In view of the dangers found in indiscriminate use of standard achieve- 
ment tests, it is important to formulate a policy to guide the sound use of 
tests in schools. In recommending the use of standard tests, there are four 
questions to be answered: When should they be used? What should they 
test? Wliich tests should be chosen? How should the results be used? Of 
these, the fourth is paramount. 

The proper function of a test in school is to improve the educational pro- , 
graim it may dd'so by helping plan what learning experiences a pupil needs, 
by indicating ways in which teaching can be improved, or by building atti- 
tudes in pupils and teachers which will promote better teaching. McConn 
reviews and discards all proposed purposes of testing in schools save one, 
guidance of the pupil (15, pp. 443-478). Guidance is conceived, however, 
not in narrow vocational terms, but as the problem of finding out all about 
the pupil so that his growth can be promoted. 

Once this point of view is accepted, several corollaries follow. If tests are 
for guidance, they are initial, not terminal, parts of the educative process. 
There is little merit in testing after it is too late to profit from the results. For 
this reason, more and more schools are using achievement survey tests at the 
beginning of tire school year. When suitable tests are given and the results 
placed in the hands of every teacher, they provide a sound basis for planning 
the year’s work. Under this point of view, there is no argument against test- 
ing again in June to determine improvement, but in fall testing the emphasis 
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is placed on diagnosis and guidance rather than marking and recrimination. 

If tests are used for guidance, they are used for the pupil rather than on 
him. They become a means for him to assess his weaknesses, and are a more 
effective argument for his taking certain courses or changing certain habits 
than pi'essure from the teacher. In such testing, it is important to minimize 
elements of marking, competition, and imposition from above. In the most 
successful programs, the pupil takes the tests because he wants to know the 
results. 

It then follows that the tests to be used have to measure something of 
importance. Some schools will seek to measure acquisition of subject matter. 
Others will be more concerned Avith general educational development, de- 
fined less in terms of knowledge than in terms of skills such as interpretative 
reading and interpretation of data. Probably there are few objectives so 
widespread diat all pupils should be tested on them at the high-school level. 
This means that in addition to every-pupil tests, tests will be needed in 
specific subjects, such as foreign language, and in vocational skills. Probably 
the tests to be used in school-wide testing will deal with the objectives which 
are the major concerns of the school: command of the English language, de- 
sirable citizenship attitudes, critical thinking, and the like. 

Increasing attention should be paid to specific performances — success on 
particular items, or particular subskills in complex tests. Attention to what 
is missed, rather than how much, is a step toward eliminating the error. 

Interpretation is likely to be adequate only when tests are comparable. 
Comparable te.sts measure growtli from year to year, and compare the 
pupil’s development in various types of behavior. For this reason, series 
of tests such as those of the Cooperative Test Service or the Iowa Tests of 
Educational Development are more adequate than tests of single objectives, 
independently standardized. At the elementary level, achievement batteries 
fiave comparability from score to score, but often use unsound grade norms. 

Use of achievement tests has been unhealthy in the past when they have 
forced schools into training rather than educating pupils. So long as tests 
are consider ed in t he light of the p upil’s past development ai^ as a gu ideTo 
_his future progi'ess , they need have no'liarmful results^ Th many respects, 
they will have to Be modified to meet these new d^ancls adequately. Tests 
of limited validity may serve tolerably as impartial marking instruments. 
But when a test bears the responsibility of desciibing what a pupil knows 
and can do, and what he needs to attain, it will have to meet a high standard 
of validity. It is in tliis direction that improvement is to be anticipated. 
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CHAPTER 13 


Probiems and Practices 

THE MEANING OF ‘TYPICAL BEHAVIOR" 

To solve many o£ their most vital problems, psychologists must draw conclu- 
sions about the typical behavior of individuals. Abilities and capacities define 
limits of performance, but what one actually does is rarely motivated to the 
point where he uses his utmost quantity or quality of performance. One 
characteristic, for example, which cannot be defined in terms of ability is 
character; nearly everyone exposed to a culture for a reasonable period 
knows what is the approved course of action. In order to study character 
formation or to use character as a guide in employment, we must in some 
way find out whether the individual habitually, characteristically, typically 
chooses the “good” action. 

“Typical behavior" is an elusive concept, becaus e be havior in a given situa- 
ti on is not tlie same o n all occasions.^ If the worker who works at 80 percent 
of his ability today would do so tomorrow and every day, it would be easy to 
describe him. Typica] behavior, however, is an abstraction. It is doubtbil if 
one ever has a truly t vnical day. Typical behavior could be described as an 
average or composite of many single behaviors, as when one judges punctu- 
ality by noting the number of tar dinesses in a month. But no”single event is 
typical of the pers on’s punctuality, so that simple testing is, difficult. 

Peha vjQr is d eter mined by the interaction of tire persoir and his environ- 
ment. One can rarely be certain drat behavior seen in one situation will also 
occur in another. We find more consistent behayior if we define the trait to 
be measured very narrowly,, so that variation in situation would be slight; for 
examp ie.~^'~rnight study “punctuality in reporting to the ofiice during the 
autumn .” The alternative is to thiirk about the subject’s behavior in a hypo- 
thetical “normal” situation, as we often do in describing cheerfulness, neat- 
ness, or submissiveness. 

To ob serv e typic al behavior, one must, in some way obtain a sa mple of , all 
relevant situations and of all times when the sitiiation arises^ No one act can 
be^ taken as typical" sin ce if is influe nced by mood, immed iately preceding^ 
pvpoTien‘^*?Si d‘?tailf! nf surroundin gs, a n d other facto rs. Even if we should 
happen to observe the person in an act that is typicallfor him, there is no 
way to be sure of this typicalness except through further observations. 

When we assume that we are measuring typical behavior, we often ignore 
the fact that behavior sh ows cycles and trends (6). Suppose that, in a series 
of observations, we find the subject to be qiiarrelsome. It does seem that this 
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is typical for him, but perhaps he is in a continuing state of ii-ritability due 
to some worry, and some months earlier or later he would be found well- 
adjusted socially. In this case, it could be contended tliat the sample was 
insuflBcient to study typical behavior, and that the period observed was a 
mere temporary deviation. Yet the deviation was real, and the behavior re- 
ported was typical for the subject during that time. 

It follows that careful attention must be paid to definition in t hinking 
about typical behaviors. One can only be sure what is meant by a. “typical 
behavior” (or trait) when the observer specifies the range of time repre- 
sented in the data, the range of situations represented, and the degree of 
motivation to "g6bd”'pSiTofmance present. While these are.rarely considered 
explicitly, it is often possible to find Aem implied in the nature of the tests 
used. Fortunately, personality and work habits seem rather stable, which 
gives hope of useful observation. 

1. Define the range, in time and .situations, of the “typical behavior” which you 

think should be studied to answer these questions? 

a. How well does each of tliese supervisors handle grievances? 

b. Does a study of the "100 Greatest Books” make an adult more rational in his 
daily life? 

c. Does viewing a film on nutrition improve housewives’ practices in menu 
planning? 

d. Do graduates of the modern elementary school write legibly? 

TYPES OF MEASURES 

Measures of typical behavior are subdivided into self-report p rocedures 
a nd be havior observ ations. I n each of these procedures, t here are inherent 
assu mp tions and problems. 


Self-Report Tests 

Self-repo rt devices have come into p rominence beca us e of the practical 
difficulties in making sufficient observations to determinej^ical behaviox- 
While the experimenter can rarely follow his subject about in his many 
activities, there is one person m a position to see how the subject acts, namely, 
the subject him&df. If we could enlist him as an observer, we would oye r- 
come the limitations of cost and privacy.. 

Self-report de.\dces usually ask the subject to look back upon his past and 
from h is memories to answer whatever questio ns, concer n the investigator. 
Rarely , an induirv employs a diary technique, in. which, events .are-recorded 
as t hey o ccur. Another variant of the self-report requires the subject to state 
only h ow he feel s or acts at a given instant. 'This involves no erjjjaujfjnenaoryy 
b ut increa ses problems of sampling, since a particailar instant mav give a poor 
Sample-of behavior over a period. Among the uses of questions, of this s ort 
are the-piiblia-npinion pnib (“How would you vote if the election were held 
today?”), .^ciqjpetrif' investigations f "Choose the thi-ee people you would 
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like to have on a committee \vith you”), and verbal substitutes for behavior 
observa tions (“What did you have for lunch today?”). Questions of pref- 
erenc £-i “Would you prefer to write a best-seller or own a successful night 
club?”) ,me self-report devices asking for a prediction of behavior in hypo- 
_ thetical s ituations no one can observe^ 

Mp sjLselfrr^sport. tests rely on a person’s insight into himself. Personality, 
adjust ment- attitudes, and interests have been studied by such questions as 
“Do you usu ally daydream?” and “Do you like mysteries?” The.micial as- 
siim ptinns inv nlvpd are that the Subject is willing to tell the truth to himself 
and to die invest igator, and that the subject can determine the truth. There 
is littln,_direct evidence that subjects can be relied upon to tell the whole 
truth a bou t their personalities. On the other hand, there is considerable rea- 
son..ia suspect that they sometimes conceal evidence they consider damag- 
ing , and psychoanalytic studies call attention to repression, whereby we 
conceal from ourselves memories which would be unwelcome. An adult, 
asked to give information regarding his own character, might be expected to 
suppress or even fail to think of events he has been ashamed of. A boy taken 
with the glamour of engineering as a profession may report great interest in 
engineering and engineering activities, even though he really lacks interest 
in all save the drama and prestige of engineering. 

Subjects are able to falsify self-reports. If a subject tries to make his score 
look “better” than tire truth, he can do so. Students took the Humm-Wads- 
worth Temperament Scale with instructions to be truthful, and again with 
instructions to assume tlrat the record would be used in considering them for 
employmerrt. Scores orr the test changed markedly in the seemingly desirable 
direction with the second directions, and correlations between the two test- 
ings ranged from —.03 (“Normal Component” score) to -f.61 (Paranoid 
score) (13, p. 119). Olsoir found marked changes of score between anony- 
mous arrd signed responses to the Woodworth-Mathews Personal Inverrtory 
(10). Fosberg gave the Bernreuter Personality Inventory with three sets of 
directions: ( a ) standard, (b ) to give the best possible picture of oneself, and 
(c) to give the worst possible. Tire correlations of scores were standard with 
“best,” .11; standard with “worst,” —.30; “best” with “worst,” —.83 i 3 1 . It is 
therefore agreed that self-report tests are undependable unless, full coop-.. 
' efa^h rs~ex^ecfjBd,J3rL.unle ss c ontrols are used to detect ox prevent. faking. 
T ypical control meth ods are discussed, in, connection, with the Minnesota. 
Multip hasic Personally Inventory^ (p. 319) and the Jurgensen Classification 
Inventdry~(p.’ 325). 

-TheLself-report, test suffers from problems of communication . Questions are 
not likely to prove reliable or valid if they mean differ ent thing s to di ffer ent 
.subjects. Y et the wxttdsLused, the respqnse,g rp qniredr _?P^ the very nature of 
-th e task are left o peiL to interptetation-in most current self-report tests. The 
subject is told, in one set of words or another, to report his typical behavior. 
He is, however, sophisticated enough to know that he has no typical be- 
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havior, but only a range of behaviors that come out in different situations and 
moods. “Do you usually seek suggestions from others?” is a fairly clear ques- 
tion; but most people would have to answer, “Sometimes I do, but not al- 
ways.” Xlris might be further qualified: “I do on difficult problems”; “I do if 
someone is around whose ideas are especially good”; “I don’t if I’m supposed 
to make the decision myself.” T^se could -be. called quibbles, but they are 
questions the subject must face in answering. Since he cannot. average., hij 
..memories, to determine what percentage of the time he has sought Sugges- 
tions, the question is likely to be answered ofthajjd.. If one person defines 
“usually” to mean “with very few exceptions” and another defines it as ‘fairly 
often; at least in difficult situations,” they are answering different questions 
and their responses are not comparable. Another example of ambiguous con- 
tent in an apparently clear item is the Kuder Preference Record question: 
“Do you like to operate an adding machine better than [two other activi- 
ties]?” Many students say that they enjoy tins activity, but would be dissatis- 
fied with a job if they had nothing to do but operate an adding machine._IJjLs_ 
.i mpos sible to qualify items to eliminate such problems of interpretati on. It 
appeafs-rfiaF in the presence of ambiguities, differences in fin al sc ore are 
stro ngly influence d by such response sets, as a habit of saying “yes” when in . 
doubt, o rjpf usinjg the “?” or “indifferent” response for unclear items (2). 
Kfest salf-riport tests employ'Suchwords as “alwavsJ - “frequen tly.” “usuall y.” 
and ‘^ oftfirtm’liat the.se worths are ambiguous was shown by Simpson, who 
asked subjects to indicate the frequency, in percent of all occurrences, that 
would be required to justify the use of each word. Table 59 shows the mean- 
ing of each word, as defined by his subjects ( 12 ) . 

Table 59. Range of Meanings Assigned to Words 
Commonly Used in Personality Tests (12) 


Word 


Given When Asked to Indicate 
What Percent of All Occasions Is Indi- 
cated by the Word 


Range of Answers of 
Middle 50 Per- 
Medfon Answer cent of Subjects 


Usually 

85 

70-90 

Often 

78 

65-85 , 

Frequeiitly 

73 


Sometimes 

20 

13-35 

Occasionally 

20 

10-33 

Seldom 

10 

6-18 

Rarely 

5 

3-10 


To .some extent, inv estigators have .hoen able toTivofd*i5hfi-crlti<asBaJhat 
sel ??re^rta ..are inaccurate by justifying their techniques empiricaHy..„One 
who wishes to consider himself healthy may overrate his health in answering 
“Is your health better or poorer than the average for your age?” Anotlier per- 
son with only minor ills may exaggerate them, perhaps without conscious 
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intention, to get sympathy or justify self-pity. This question would not obtain 
valid facts about health. But if i t canbe established that clinically diagnosed 
ir curoti cs reply “poorer” more frequently than do normals, this answer may 
be diagnostic e ven when it is “untrue” — ^in fact, it may be diagimstic just be- 
cause it is u nfaue. 

Em piric al uses of s elf- reports are necessarily valid. The report itself is a , 
behavior^ one obtains a direct recor d of response to a ^ancTafdized’ stimulus 
when he ask s a verbal ques tion. Empirical scales, according to Meehl, take 
the “attitude that the verbal type of personality inventory is not most fruit- 
fully seen as a ‘self-rating’ or self description whose value requires the as- 
sumption of accuracy on the part of the testee in his observations of self. 
Rather is the response to a test item taken as an inti insically interesting seg- 
ment of verbal behavior, knowledge regarding which may he of more value 
than any knowledge of the ‘factual’ material about which the item super- 
jBcially purports to inquire. Thus if a hypochondriac says that he has ‘many 
-headaches’ the fact of interest is that he says this” ( 7, p. 9 ) . 

The usefulne ss o f empirical tests depends primarily on the adequa cy of 
the exp erimentation^ by which the scale has been validated- M anif est ab- 
surdities in em piric al treatment of self-reports suggest that the original Vi^- 
dation waTiiot ba s ed bii an adequate-sample^ Allport, for example, protests 
against a scale in which the word association “green” to the stimulus "grass” 
is scored -1-6 as a sign of “loyalty to the gang” (1, p. 329). Loyal boys might 
have given this response often in the sample te.st, but it is implausible that 
this would be found in further studies. Large and representative samples are 
crucial in establishing empirical keys. 

In view of the diflSculties encountered, self-report tests, especially those 
not based on empirical validation, have been severely criticized. Much work 
has been done to find acceptable substitutes, and it is p robably fair to say^ 
that self-reportjmeliuads should never. be used where .ac ceptable siibstituies. 
are available. B e cause s elf-report tests can be short a nd inexpensive and be - 
cause they frequentlyliave crude validity fm- tbp mass nf eyep when 

invalid in individual cases, t h^ are ugljkely to , b.a.CQinplotely replaced. . 

2. Distinguish in each of these cases whether the investigator is assuming that 

self-reports are truthful. 

a. The clinical symptoms of condition * are determined by observation. A list of 
symptoms (swollen feet, rash, etc.) is prepared. This list is used in many 
localities to determine how frequent condition x is. Each subject is asked to 
check whatever symptoms he has. 

b. There are three general stages in social development in which a child names 
as favorites (fl) other children without regard to sex, (7?) persons of his own 
sex exclusively, (c) persons of the opposite sex. The investigators ask a child 
to name his favorite playmates as a means of determining his level of devel- 
opment. 

c. A psychologist administers a check list to a group of applicants, in which 
each checks the adjectives he considers as describing himself. The success of 
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these men is observed, and a list is made of the characteristics checked by 
the successful applicants but not by the others. This check list is then given 
to fm-ther applicants, and those who check the same characteristics as the 
previously successful men are hired. 

3. List several situations in which signed self-reports are likely to be truthful. 

Observations 

Among the techniques in which actual behavior is obseiwed in order to de- 
duce typical behavior, three ^pes may be noted: observations i n natural 
situations, observations in standardized situations, and projective methods. 
These categories overlap, but a distinction calls attention to special purposes 
and problems. Observation in the natural situation calls for studying actions 
under everyday conditions, covering- ar narrow- or wide range of the subject's 
life. One may study work hahits by observing the worker at his bench day 
after day, or one may study flying performance by observing the pilot as h e 
handles his plane. A broader range of situations would be involved in study- 
ing the social behavipr_.of-a. child, which could be done by having each 
Teacher report observations made in classes and on the playground. Stand- 
ardized obse iy atipns are made in situations so planned that every subj ect 
undergoes approximately the same experiences. If every person works under 
the same conditions, what each does will be primarily a reflection of his in- 
dividual characteristics. In moje^ve. tests the subject is turned, loo se i n an 
unfamiliar situation where he presumably has no specific habits f or beh av- 
ingj what he does in this “unstructured” situation tells a great deal, about fiis 
deep-lying behavior tendencies. 

A cmcial problem in observing behavior is to make sure that, the act of 
observing does not distort the behavior. Just as the presence of a traffic cop 
at an intersection raises drivers from tlieir habitual level to nearly their best 
ability, so the presence of the judge or observer may cause the subject .to t^ 
harder. This effect seems to occur even when no reward or punishment is 
involved. Roethlisberger and Dickson attempted to study typical work out- 
put under a variety of conditions at a Western Electric manufacturing plant 
(11). Relay assemblers were placed in a small experimental room where 
they could be observed and their output recorded in great detail. In suc- 
cessive periods various rest pauses were introduced, to test the effect on 
production. As each change was introduced, no matter what it was, produc- 
tion for most workers climbed. Finally, in the twelfth and thirteenth periods, 
the rest pauses and privileges were removed, and production per hour still 
was as high as under the “best” working conditions. Another striking change 
was that absenteeism changed from 15.2 days per year per worker before 
entering the study to 3.5 days per year after workers were placed in the test 
room. The investigators were forced to abandon their study of rest pauses as 
a single variable, since the morale of the workers — as a result of being more 
closely singled out for study, of being better acquainted with their super- 



visors, and of feeling personal responsibility for their rate of output — basi- 
cally altered their performance. 

-JUistortioiLdue to judging is reduced when the judging is a regular part of 
t he wor k.procedure. Ratin gs by foremen are probably based on typical be- 
havior in .the plant,, for the foreman is usually present; how the man would 
act if the foreman were not provided is not in question. Wherever the rater 
cmi be ^ n orma l member of the group, his presence will have little distorting 
effect on res ults. These distortions may be minimized by concealing the ob- 
server ashy the one-way vision screen employed for studying infants (4), by 
giving tire observer an ostensible reason for being present, as in studies of 
ado lesce nts where adult observers participated in their picnics (9), or_by 
establishing familiarity through a long series of preliminary observations. 

The stodard observation has definite advantages over observations made 
in norm al random conditions. Fair comparisons between men are impossible 
if the normal situations under which they work are unlike. Watching sales- 
man A deal with prospect X, and salesman B with prospect Y, is likely to be 
misleading if the prospects differ in temperament and approachability. 

Standardiz ed observations reveal behavior typical of non-test situations 
only when the subject is unaware that he is being observed. This is hard to 
inanageTFut one can often assume that the subject does not know what char- 
acteristics are being observed. This is illustrated in a stress investigation. Air- 
crewmen were asked to observe a pattern of five planes in a short exposure 
on a motion-picture screen, and to record the locations of the planes. Suc- 
cessive items were shown with shorter exposures, until it was impossible to 
observe and record in the time allowed. A few easy items, with ample time, 
were spread through the test. Decline in score on these items was taken as a 
measure of disturbance due to frustration (5). Since the subjects thought 
they were being tested on perceptual ability, it is unlikely that they recog- 
nized that the frustration was planned to observe their resistance to it. This 
test was not used practically, so that evidence of empirical validity is lacking. 

4. If a subject suspected that the recognition test was a frustration experiment, how 
would he probably behave? 

5. Is the observation of actual driving usually required of applicants for drivers’ 
licenses a test of typical behavior? 

6. Devices have been invented to record whatever changes are made in the setting 
of a radio or television dial. By installing a large number of these in coojrerating 
homes, broadcasters might obtain precise data on when programs gain or lose 
listeners. Do you think the typical tuning behavior of the set owner would 
change when he knows this device is added to his set? 

Projective tests atte mpt to avoid the systematized behayiors subjects, use, 
to conceal their indivi3ualfty. CHirdreh are easily observed in daily life, be- 
cause they ar^relafive^aaive and have not yet learned that it is desirable to 
draw a curtain over their personalities. As maturity increases, one represses 
more and more of his thoughts and controls his actions. By adulthood, one 
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has a series of etiquettes, customs, and cliche in which he can hide himself. 
Applicants for secretarial positions display an astonishing similarity of words, 
smiles, and actions during an interview. Insofar as they can guess what is 
expected of an ideal applicant, they act that role. Yet the go al of the inter- 
viewer is to determine not ability to assume a mask but underlying person- 
ality. The problem of the tester is to penetrate the fagade in a short time . The 
ideal answer inay be to present a situation so novel that the subject has no 
idea what , disguise to assume. /This sliould reveal his “basic nature.” jPjrO;::, 
iect ive teats-attempt to permit this type of obseivvation. Because these tests 
seek to uncover personality structure rather than specific isolated habits, 
Murphy makes this extreme statement about one of them: “The Rorschach 
should predict a specific future behavior better than does tlie study of past 
specific behavior, except when the new behavior is essentially a repetition of 
the old” ( 8 , p. 675 ) . Owing to the newness of projective techniques, it will 
be some time before practical uses are fully developed. 
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CHAPTER 14 


Self-Report Techniques: Personality 

HISTORICAL AND THEORETICAL BACKGROUND 

History of Personality Questionnaires 
Self-rgDort techniques have a long history. Mfidicine has long used the pa- , 
fienFs report of his symptoms for diagnosis nnd evaluation of treatmejjt.-la„ 
psychology, almost the entire development of modern hypotheses-about-pei* 
sonaIity_ hinges on self-report. Xtcud, for example, based h is find ings and 
theories almo.st w holly o n interviews, reports of dreams, and other intro- 
spective data. Attempts to reduce personality to psychometric terms grew 
out of the need for mass-processing methods capable of more speed and 
standardization than clinical interviews permitted. As in the case of intelli- 
gence testing. World War I brought into prominence the possibilities of 
group procedures for obtaining self-reports. 

Essential lu, a ‘personality questionnaire is a standardized mteroie w. The 
^nsitivity of individ ual questioning is sacrificed for spe^.. The firs t im- 
portant questionnaire was Woodworth’s Personal Data Sheet, widely used 
in ^dcessing WrtrM- War~T' recruits'. The Personal Data Sheet ( 45 ) was a cob, 
lection of questions about s ymptoms considered significant,by„psyphiafrists 
( daydreaming, enuresis, etc. ) . Woodworth’s scale was found useful for de-, 
tecting maladjusted soldier s, and was especially valued because adequate in- 
dividual interviews were totally impracticable. 

Following World War I, enthusiasm for mental testing led to a demand for 
equally promising tests of personality. A swarm of instruments for probing 
the individual were devised, many of tliem adapting Woodworth’s questions. 
Each test consisted of a collection of questions presumed to measure one 
aspect of personality. Among the supposed ti’aits often investigated were neu- 
rotic tendency, introversion-extroversion, and dominance. Items were chosen 
which, by definition, were related to each trait: “Do you worry . . . ?” would 
show neiuotic tendency; “Do you lead . . . ?” would show dominance. In 
some cases the assumption that the items all related to a single trait was 
checked by internal-consistency tests or an outside criterion. Woodworth, for 
example, used only items which were checked twice as often by known psy- 
choneurotics as by normals. The Bemreuter Personality Inv entory , ^nih lis hed 
in 1931. was intende d to serve all functions of previous tests, being con- 
structed by pooling items Troffrtestsrby'TMrsrOHSrTSfiM'.'Allpoft, and others. 
Separate scores were obtained for the traits — neurotic tendency, introversion, 
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and dominance — the previous authors had sought to measure, plus self- 
sufficiency. Because it appeared to ofFer a considerable amount of informa- 
tion about the individual, the Bernreuter test was widely accepted. Although 
there have been recent innovations in the design of personality tests, these 
newer ideas are only beginning to be widely used. Most curreui^jspnahtjL 
testing in schools and industi-y is done with devices like Berm'euter’s. 

The results of questionnaires were not always satisfyiiig. Because they 
were in troduced at a time when intelligence tests were being use d to “ma ke 
psychological measurement scientific,” personality tests were expected by 
many to cohfiibute a high level of accuracy to studies of . personality. The 
fact that these devices could yield no more than an interview and would 
usually yield less only gradually became apparent. With this realization, un- 
critical use of questionnaires began to decrease, and more thoughtful bases 
for their construction were generally adopted. 

One major line of attack was improved definition of teaits to be tested. Re- 
search since 1931 has been largely directed at replacing the traits arbitrarily 
assumed by early investigators with traits representing genuine and signifi- 
cant aspects of personality. One approach to this is factor analysis. J'lanagan’s 
factor analysis of the Bernreuter test ( see p. 316 ) broke the ground in this 
a re^ Recent studies have provided lengthy lists of fairly pure traits which 
may be tested. 

Some- of the most promising recent instruments have placed their em- 
phasis on scores cla imin g empin'eal validity, making psychological interpreta- 
tion of scores a secondary consideration. This is the case with the Minneio tai 
"Mul t iphasic Personality Inyento^ ( MMPI ) and the Humm-Wadswortli 
Temperament Scale. Item s were s elec ted, not because they fit ted a definitio n 
of a trait, but because experimental trial showed that mental patients gave 
resp^onses diflFerent from those of normal adults. Scores on these tests indicate 
how far the subject deviates from normal in such directions as “paranoid,” 
“hysteroid,” or "manic.” The manic key is simply the group of items that per- 
sons classified as manic answer differently from normals. This is the .sam e 
procedure that -WoodworttLOEiginally. used to choose his ite ms, but wh ich. 

^ was largely n eglected in subsequent questionnaires. 

Although f aith in persona lity questio nnaire s, except as “preliminary inter- 
views,” hgd^qmewhat declined by 1940 , interes t has been revived by their 
excellent service during World, War II. Industries, anxious to expand their" 
working forces rapidly without hiring misfits, made much use of the tests and 
reported satisfactory results, although scientific evidence of validity was 
scant. In the armed services, rapid screening devices were necessary, and for 
this purpose two new questionnaires were widely used. The Cornell Selectee 
Index and the Shipley Personal Inventory were found to be rapid and ef- 
ficient tools in military processing. 

In virtually every walk of modern life, personality and mental hygiene are 
crucial. Maladjusted parents make for maladjusted children. Teachers with 
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inadequate personal adjustment are unsatisfactory. Employees who grumble 
or are temperamental cause undue trouble. In most chronic illnesses, doctors 
now recognize a psychological element. In these and all the other areas 
where doctors, psychologists, and social workers would like to identify, diag- 
nose, and assist the emotionally immature person, a rapid group test could 
be of untold value. As present techniques become more valid, they, will aid in 
solving major social problems. 

1. Which of the following investigators assumed that self-reports are truthful: 

Woodworth, Bemreviter, Hathaway and McKinley (MMPI)? 

The Trait Theory of Personality 

Nearly all the pr ominent questioima ii~es haYe . sought to. describe the indi-,. 
vidual in t erms of taits. If a test is to be a measuring stick, assigning a rank 
or score to the individual, there must be a cbaracteristic or dimension in 
which this variation takes place. The desire for linear measures analogous to 
those for size, temperature, and reaction time led psychologists to postulate 
that p^spnality had dimensions or traits. A. trait is a tendency to react in a 
de fined w ay in response to a defined class of stiniuli. Traits are familiar in 
everyday thinking; nearly all the adjectives which apply to people are de- 
scriptions of traits: happy, grouchy, conventional, stubborn, and so on. Trait s 
are elusive, ip,_scien.tific. analysis, however, and can be defined and meas ure d 
bn^ at the risk of some ambiguity. 

The pos tulate that traits exist is based on three facts: ( 1 ) Personalities pos- 
sess considerab le consistency; a person shows the same habitual reactions 
over a wide range of similar situations. (2) For any habit. we,,c.an .find among 
people a variatio n of degrees or amounts of this behavior. ( 3 ) Person alities 
have some stability, since the person possessing a certain degree of a trait Ais 
~ year usu ally shows a .similar degree next year. . 

The se fact s lead one to consider personality traits as habits, capable of be- 
ing evoked by a wide range of situations. It would he tedi ous-tn-catalog-a, 
seri es of traits such as “habit of bowing politely when meeting a pretty 
woman of one’s own age on the street on Sunday," “habit of bowing politely 
when meeting a not-pretty woman . . . ,” etc. Therefore ti'aits are sought 
which des cribe consistent behavior in a wide range of situations. Any value 
of the trai t approa ch to personally depends on the hope that it will describe 
economically .the. sig nificant variations of behavior, neg lecting; j,mduly.,spe- 
' .cific^abits. Since the English dictionary oflr’ers no less than 17,953 adjectives 
describing traits, the problem of economy is a serious one ( 3 ) . 

A trait is a comp osite based on many specifljc. behavi ors covered by the 
■ trait name. To that a boy is perfectly hon est gives information whmh defi- 
nitely predicts his beh avi or in any .situation imJ^ving honesyV Tn most cases7 
hoover, the person possesses an intermediate degree of the trait in question;' 
he is honest in~s6ine 'situations but nof ' others. Two people possessing the 
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same composite score in a trait may not be alike in personality. We might say 
that a boy is “50 percent honest ” implying that there is an even probability of 
honest or dishonest behavior. But the description would conceal the fact that 
he is perfectly honest in money matters and perfectly dishonest in situations 
in which grades are the reward; the d^cription, in .terms of hi.s h ypotheti cal 
trait “general honesty” is an imperfect basis for predict ion. Th e extent to 
which a trait gives good descriptions depends upon the extent to which the 
~B5iaviors c ollected under the trait definition are coherent and present in the 
sanie_pgrs^n, Traite have therefore been defiiT^ed most often by intcrnal-con- 
j^stency tes{s._A group of items or test situations is collected whicli may re- 
late to one aspect of behavior, and tlie degree of relation is tested empirically 
by the method presented on p. 78. Items which do not correlate with the 
composite are dropped. The trait is, in the end, defined by the items retained 
in the test. 

2. Show how the trait “stubbornness” might be present in some .situations and 
absent in others for the same person, even though both actions are typical 
for him. 

3. An investigator believes that leaders can be characterized by the trait “auto- 
eratic-democrafic.” Outline the procedure needed to make a self-report test, if 
he decided to choose his items 

a. By definition. 

b. By internal-consistency tests. 

c. By correlation with an external criterion. 

Traits-ha.ving,diffjren.t_ names may n ot be genuinely diff erent . “Introver- 
sion” was postulated as a trait after clinical studies showed that certain be- 
haviors such as isolativeness, interest in ideas, and shyness were often present 
together. “Neurotic tendency” attracted interest as a result of studies which 


Table 60. Intercorrelations of Bernreuter Scores for Adolescent 
Boys, Corrected to Remove Spurious 
Elements ( 14 , p. 44) 



Neurol ic 
Tendency 

Self- 

Sufficiency 

Introversion 

Dominance 

Neurotic Tendency 


-.39 

.87 

-.69 

Self-Sufficiency 



-.33 

.51 

lntr6version 




-.62 

Dominance 






showed the neurotic pattern to contain certain recurrent symptoms. Both 
were widely used to describe personahty, until the Bernreuter scale showed 
correlations between the two of .96 (this containing a spurious element be- 
cause both scores were based on the same questions ) . Flanagan applied fac- 
tor analysis to the four scores from Bernreuter’s test, using a group of 305 
adolescent boys. The correlations shown in Table 60 indicate that the scores 
overlap greatly and are certainly not reporting fom- independent traits. Flan- 
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agan found that two factors were sufficient to account for the major differ- 
ences between individuals reported in these four scores. He therefore pro- 
posed two new scores, to represent the two traits the Bernreuter test could 
distinguish. His scores, called Self-Confidence and Sociability, are nearly in- 
dependent (r = .04), One may have high confidence with or without socia- 
bility. The old “neurotic tendency” is about the same as low confidence; “self- 
sufficiency” is merely a combination of high confidence and low sociability; 
and so on ( H ) . 

A inoi e-ba.si c approach to trait identification considers the intercorrelations 
b etween single items. Items havi ng substantial correlations are brought to- 
gether to form TEiroader traits. Cattell^ lfohffa’cTbfial stiidies, considers that 12 
such traits have been identified (10). Guilford has published several ques- 
tionnaires measuring such presumably homogeneous baits. His Inventory of 
Faetors S-T-D-C-R, for example, breaks the familiar introversion-extroversion 
into social introversion, thinking introversion, depression, cycloid tendencies, 
and rhathymia (happy-go-lucky disposition). Some of tliese scores are cor- 
related, so that the traits are not independent (18). The Guilford list has 
been simplified by Thurstone into a set of seven “primary” traits which most 
often affect scores in current self-report tests. These seven aspects of tem- 
perament are: reflective (thinking introversion), friendly, emotionally stable, 
masculine, ascendant (leadership), active, and impulsive (42), 

“The normal personality” presents troublesome problems of measurement. 
It is comparatively easy to identify abnormals, since tliey tend to fall into 
distinct syndromes or patterns of personality. In most traits, the deviate at 
either end of the normal distribution is well characterized by his score. He ex- 
hibits the trait in unusual degree and in a large number of situations. Inter- 
mediate positions, falling toward the center of the distribution, tell the in- 
vestigator little except that the subject is not a deviate. Every normal person- 
ality has its unique characteristics. Even a person who is “normal” in all the 
traits we measure has individuality. Reducing his performance to a single in- 
dex of normality, we lose patterns which make him different from his also- 
normal neighbors. 

Allport ha s criticiz ed th g .entire trait appr oach on th is ground ( 2, pp. 248- 
257). Suppose^ that honesty were statistically coherent, so that the correla- 
tions between different types of honesty were substantial. This might be ex- 
plained in terms of differences in moral upbringing which make some people 
well-behaved throughout their lives. But there are other organizing patterns 
underlying honesties which make then correlations less than 1.00. One man 
may act from need; he will take money to feed his family, but will not cheat 
or lie. Another man may be prudent ratlier than honest; he will be as honest 
as he must to avoid being caught. Another may define honesty in a limited 
way; he would never steal, but he thinks it right to operate a business on the 
principle of “buyer beware.” These men are all honest to an intermediate de- 
gree; they are all more or less “normal.” The problem in studying personality 
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^to detect the true patterns behind behavior. Although tlie trait method will 
not do this, there is probably no solution but the trait approach for stu dying 
persona-lities in the mass. 

4. Some testers treat Flanagan's hvo scoring keys as supplements to the four sup- 
plied by Bernreuter, and report all six scores to describe an individual. Discuss 
the advisability of this practice. 

5. Do you think that most introverts are emotionally maladjusted? What does this 
answer imply regarding the correlation of the N and I scales? 

6. What is a “desirable” score on a test such as the Bernreuter? 

7. What does it mean to say that Henry falls at the 50th percentile in SociabiUty? 

8. “Adjustment" inventories ask the person to indicate his worries or symptoms. 
The number of such symptoms is taken as his score. What meaning should be 
attributed to a median score on such a test? 

SIGNIFICANT SELF-REPORT TESTS 
To understand how psychologists have designed self-report tests, and to in- 
terpret the evidence on validity, we shall need to know a few tests in detail. 
The following section describes the Mimiesota Multiphasic, which represents 
unusua,lly careful test design and seems useful for studying abnormal siiB- 
jects; and the Bell and Bernreuter inventories, which have been widely used 
and for which much evidence on validity has accumulated. Other tests to be 
described are important, not because they are widely used, but because they 
have unique and promising features. The Allport-Vernon Study of Values, 
Jurgensen’s Classification Inventory, the Shipley and Cornell inventories for 
military use, tlie Rogers test for children, and the PEA Interest Index fall in 
this group. 


Bell Adjustment Inventory 

The Bell Adjustment Inventory is simple in design and intended f or u se 
with normal groups rather than for- clinical analysis. The inventory has ijyp, 
forms, one for students in high school and college, the other for aiilts. The 
student scale yields separate scores for home, health, social, andZSSStiffloal 
adjustment. Items, a nswe red * yeSj” “no,” or “?,” include 

Do you get discouraged easily? 

Are you subject to eye strain? 

Would you feel very self-conscious if you had to volunteer an idea to start a dis- 
cussion among a group of people? 

Questi ons retained in the scale are AosefouncL to differfintiate b etwee n stu- 
dents known "to’ Be maladjusted and students considered normal by judges 
who know them well. The principal use of the test is to ideiitifyjjtiose^wbo^ 

. should be offered counseling. While “problem cases” who cause trouble for 
teachers Me easily recognized, students who are withdrawn, shy, and in- 
ternally upset may not attract the attention of observers. An adjustment in- 
ventory brings to light many of these cases. 
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Bernreuter Personality Inventory 

The Bernreuter test has been referred to at several other places in this 
chapter. It was first designed for use with college students and has been ex- 
tended-.to,. pupils in schools, employees, and otlier groups. It provides a de- 
scription of personality which, if valid, would be of intere,st in case studies 
and research. The Neurotic Tendency score is a convenient index of self-re- 
ported maladjustment. 

The test consists of questions much like those of other inventories: e.g., 
“Do you daydream frequently?” The test seems primarily to measure self- 
confidence and sociability. Scoring is relatively complex, particular answers 
bang given weighted credits. For the item on daydreaming, “yes” is credited 
3 points, “no” is credited —5, and “?” 0, in the Self-Confidence score. Other 
credits are assigned to the same answers in each of the other keys. 

9. Since “neurotic tendency” and "maladjustment” are measured by similar items, 
the Bell and Bernreuter inventories may be used to identify the same types of 
students. What practical considerations would be important in choosing between 
them, if they are equally reliable, and validity in the situation under considera- 
tion has not been determined? 

Minnesota Muitiphasic Personality Inventory 
The self^repprt test most promising for diagnosing abnormal personality 
patterns in the clinic is the MMPI. lihis test consists of 495 questions to be 
jilted “true,” “false,” or “camiot say.” Typical items are 

I think most people would lie to get ahead. 

A windstoim terrifies me. 

My hardest battles are with myself. 

These itema..wjBre administered to di agno sed psychiatric patients and to 
nomml adults, principally visitors to a large city hospital. Responses were 
compared, and any answer given more often by one of the psychiatric groups 
than by normals was placed in the appropriate scoring key. The classifications 
scored are h ypochondriasis (Hs), depre ssion (D), hysteria (Hy), psycho-' 
pathic deviate (Pd), masculinity (Mf), paranoia (Pa), psychasthenia (Pt), 
s^izophrenia (Sc), and hypomania (Ma). The higher a subject’s score in a 
classification, the more his answers resemble those given by that type of 
patient. Any standard score greater than 70 is taken as an indicator of sig- 
nificant abnormality. 

Some errors common to self-report tests are controlled by a set of scores 
known as ?, L, K, and F. The "?” score is based on the number of times the 
person replies, “cannot say.” Excessive evasion of questions of course makes 
it impossible to compare a count of the subject’s reports of his symptoms 
with the replies of the psychiatric group. Profiles showing hig^h “?” scores are 
recognized as invalid. There are some test items so extremely worded that a 
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person who denies having these symptoms is almost certainly not evaluating 
himself frankly. The L (lie) score is based on a count of these improbable 
answers; a high-L- score indicates that answers are untrustworthy, but need 
not indicate deliberate lying. The K score is based on the finding that some 
normal people earn scores above 70 in significant columns. A list of items 
checked by these “false positives” was studied and built into the K correction 
key (29, 31 ). A person who earns a very low.K score is likely to have been 
unusually severe in describing himself while taking the test; apparently, ex. 
cessive frankness causes some nonnals to earn “abnormal” scores^ One who 
habitually gives himself the benefit of the doubt earns a notably high K 
score. The F (false) score is a count of replies givein by the subje ct which 
are given by others extremely rarely. This count reveals careless. 5 orting, mis- 
understanding, or other invalid answers, 

The operation of these keys is shown in a study in which psychologically 
trained persons attempted to “fake” in answering the test so as to appear 
psychoneurotic or psychotic. In attempting to give answers like neurotics, 
these experts succeeded (their mean Hs was 99; mean D, 108; mean Hy, 87; 
mean Pt 86), but their mean K was 44 whereas real neurotics averaged 51, 
and their mean F was 74 compared to 67 for true neurotics. In simulating 
psychotic answers, they also succeeded fairly well (mean Pa, for example, 
was 97), but mean F at 80, mean L at 60 signaled that something about the 
responses was untrustworthy (16). 

Although the test has been used succes.sfiilly with clinical patients, it has 
not-been f ound trust wo rtliy with co llege, stude nts. Mmiy college students 
.-jeam-scores. whifib..K 99 ldusually be indicative of abnormality, although t hese 
students are known to be ade(juafe^ adjusted. This is a further example of 
the undesirability of blindly applying a test validated on one population to a 
different type of group. 

The MMPI is q uite similar to t he earl ier H umm-Wad swo r th Tempera - 
ment Scale . This scale has been less widely investigated than the Minnesota 
because commercial restiictions limit its availability. It has had wide appli- 
cation in industry. The temperament scale reports scores based on a psychi- 
atric theory of the components of personality. It employs . a.-Gh eck .snore to 
identi fy undependable self-report. Both the MMPI and the Humm-Wads- 
worth e mploy in geni ous coj^ectioiTTormulas to compensate for a su bject'’s 
tendency to give a favorable or over-frank ^lf-£^prt ( 31, 23 ) . 

10. Under what circumstances might a person wish to convince a tester that he is 
psychotic, so that his scores would be falsely high? 

11. A client coming to a social agency shows these MMPI scores: 

Hs D Hy Pd Mf Pa Pt Sc Ma 

43 45 50 50 50 68 42 67 69 

How would the interpretation be affected if the “control scores” were as fol- 
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a. ?, 72; L, 50; K, 50; F, 50. 

b. ?, 50; L, 73; K, 50; F, 50, 

c. ?, 50; L, 50; K, 72; F, 50. 

d. ?, 50; L, 50; K, 35; F, 50. 

12. How could one restandardize the MMPI so that it would correctly locate per- 
sons with menial hygiene problems among entering college freshmen? 

The type of report of which the MMPI is capable is indicated in the fol- 
lowing analyses. The first is that of an 18-year-old boy, committed from a 
juvenile court after arrest for homosexual practices. The MMPI was admin- 
istered individually, the examiner reading the items aloud, since Earl’s men- 
tality was limited. The scores, plotted in Figure 74, are discussed as follows 
(9, pp. 494-495): 



Fig. 74. MMPI scores for Earl (9, p. 494). 


'"The validity scores are conventionally acceptable, with 50 T-scores for ? and L, 
and 60 for the F scale. The latter score shows a slight tendency toward careless- 
ness or lack of comprehension, but not so great as to invalidate the scale. It can be 
seen that — ^with the exception of the masculinity-feminity scale (Mf) — all the 
scores lie toward the maladjustment end of the distribution; but hypochondriasis, 
psychasthenia, schizophrenia, and hypomania ai-e all beyond the 70-point cus- 
tomarily regarded as indicative of maladjustment. The hysteria score is approxi- 
mately at 70. This is a non-specific record and serves best to point up areas of 
maladjustment rather than specific psychiatric entities. The high Hs, Hy, and Pt 
indicate a neurotic trend; Sc, Ma (and also Pt) a schizoid ti-end. These trends in 
part characterize the subject’s personality structure. Earl complained of periodic 
headaches and dizziness and had been observed by other inmates to ‘black out.’ 
These experiences . . . are probably reflected in the Hs and Hy scores. His history 
of shyness and solitariness [had] already been noted. The hypomania score (Ma) 
accounts for his minor periods of overproduction and temporary enthusiasms. It is 
interesting to note in passing that despite EaiTs arrest and commitment for homo- 
sexual activities, and certain feminine secondary sex characteristics, his interests 
— as measured by the Mf scale — ^are not feminine.” 
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In this case, the MMPI did not provide a specific diagnosis, though it was 
useful in establishing severe maladjustment and throwing doubt on the im- 
portance of Earl’s sexual irregularities. The final diagnosis, based on en- 
cephalogram and clinical analysis, was intracranial brain damage, probably 
following biiTh injury, complicated by a completely unfavorable environ- 
ment. 

One can proceed beyond a general estimate of adjustment to a descrip- 
tion of the subject’s traits, if one is willing to assume that self-reports are true 
self-descriptions. Verniaud tested women optical workers and found that 
over 77 percent of her group exceed the norm in Hypomania, Psychasthenia, 
Paranoia, and Psychopathic Deviate. She interprets this as an expectation 
that these workers would be “restless, full of plans, alternating between en- 
thusiasm and over-productivity in energy output and modes of depression, 
more inclined toward anxieties and compulsive behavior than the average in- 
dividual, disinclined to concentrate for long periods on one task, somewhat 
oversensitive or suspicious of the good-will of others, somewhat more in- 
clined than the average woman to disregard social mores.” One woman 
earned these scores; 

? L F Hs D Hy Pd Mf Pa Pt Sc Ma 

50 53 50 41 42 52 67 63 67 60 58 68 

Additional findings about the case were as follows ( 43, pp. 610-611 ) : 

“This woman is 34, a high school and business school graduate. She is married, 
with a daughter of 12. She started out as a clerical worker, worked up to the 
position of secretary to the editor of a local newspaper, advanced from this to 

newspaper reporter. However, she was put to reporting women’s affairs, had no 

chance to do general reporting as the men did, and after two years left to work for 
a clothing manufacturer. She started out as his bookkeeper, but also went out into 
the plant to do cutting. She disliked the bookkeeping, ’loved’ the cutting. . . . 

“She is now supervisor of the polishers. By the shift foreman’s report, she has 
transformed the polishing room from one of their major headaches to a ‘hangup’ 
job. She is exacting to work for, unpredictable as a person. She likes to do ‘screw- 
ball’ things. Once she emerged from die polishing room with the red polishing 
compound smeared on hands and arms to the elbows, waved diem in the investi- 
gator’s face, saying ‘Blood!’ During the holidays, when you heard a jingle, you 
knew that the supervisor of polishers was going through. She had tied bells to her 
shoes.” 

13. Under what conditions may a person have deviate scores on MMPI and yet not 
need clinical attention? 

14. What assumptions are made if a psychologist applies MMPI to adolescent boys 
coming before the juvenile court? 

15. Would it be legitimate to correlate scores on MMPI with job success, to see if 
it would be a helpful selection device? 

16. If, as some psychologists predict, a new set of diagno.stic categories comes into 
use for describing patients, what steps will be required to adapt MMPI to the 
new system? Assume that die changes are fundamental, such as replacing para- 
noia and schizophrenia with four new categories. 



SELF-REPORT TECHNIQUES: PERSONALITY 


323 


Study of Values 

A unique instiaiment which has been primarily a. reseiarch tool is the All- 
port-Vernon Study of Values. It is baseiLon. the conception of Spranger that 
t here are six b asic types of men: theoretical, whose interest is truth (“His 
chief aim in life is to order and systematize his knowledge”); economic, 
whose interest is in the useful, the practical aflFairs of tlie business world; jor. 
clalj-wiKo values love of people; polit^l, primarily seeking power; religious, 
w hoTwo rks for unity, “seeks to compreHeiid'the eosmos as a whole”; and 
a esth etic, who stresses form and harmony. The subject is asked to indicate 
which of two activities he would prefer — for example, “Would you prefer to 
hear a series of popular lectures on (a) the progress and needs of social 
service work in the cities of your part of the country or (b) contemporary 
painters?” The former is scored for Social value, the second for Aesthetic. A 
profile of six scores is obtained. 

The test is at time^.. valuable in academic and vocational guidan^^because 
it is based on a set of variables which cut across the categories of^ more com- 
mon interest tests. Some “theoreticar’-type persons, for example, find little 
satisfaction in the day-by-day routines of some positions in industrial chem- 
istry, yet enjoy teaching chemistry, 

17, A psychology major, who has a keen interest in psychology courses, wishes 
guidance toward a particular vocation. 

a. What suggestions might be considered if he shows a high score in Theoreti- 
cal value? 

b. What vocation in psychology would be most appropriate to each of the 
other values? 

Rogers Test of Personal Adjustment 

In contrast to the trait approach taken by the bulk of tests, the Rogcr s -t e s t -- 
for children and the PEA te st for adolesc ents operate on a the ory o f pe rson- 
ality as a complex whnlfi. In the usual test, a response is treated as if it had 
the same emotional import for all subjects, Yet “I find it hard to make new 
friends” may, in one person, be a sign of anxiety, and in another person may 
be evidence of mature and calm insight into his limitations. ByjcmnsideEUig 
self-descriptions in relation to_the subject’s stated aspirations, Rogers obtains— 
a~ncher .Rigtui-e than anjTstudy of s9f-Uescriptibn'aIone can' provide. Neces- 
sarily, the more complex interpretation is difficult to objectify. 

The test has the additional advantage of format and questions that appeal 
to children. Questions deal with the child’s fantasies, wishes, opinions of him- 
self, and affection for others. This is an example: 

Peter is a big, strong boy who can beat any of the other boys in a fight. Am I just 
like him? Do I wish to be just like him? 

The child answers by checking a scale from “yes” to “no.” There are separate 
forms for boys and girls. 
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Although scores are obtained on the test, an analytic study of items is re- 
quired to obtain a clinical picture of the child. A diagnosis reported in the 
manual reads: “On this test Edward shows evidence of a great deal of mal- 
adjustment. His outstanding difBculty is his lack of adjustment to other boys 
and girls. He evidently has no friends, is not well liked and lacks ability in 
games. He also feels inadequate personally. He thinks he is lacking in brains, 
strength, and looks. He covers his feelings of inferiority with an extravagant, 
boasting attitude. He also indulges in a great deal of day-dreaming. He seems 
most attached to his mother, and may be somewhat jealous of Iris father and 
brother, although the evidence here is contradictory. He is to be regarded as 
a personality problem of a very serious sort.” 

18. Wliat “traits” are discussed in the description of Edward? How does it happen 
that so many facts are reported from a test of about 50 items, whereas oflier 
longer tests report fewer traits? 

Interest Index 

A similar clinical treatment has been developed for the Interest Inde x. 
8.2a, 8.2b, and 8.2c, developed, ipr tire Progressive Education Association. 
This is included here rather than in ti;{e"chapter on interest testTKecause its 
primary-. use is for description of personality. The test, fpr.,lugh-sohoo]- pupils, 
has the advantage that students tend to be more frank in expre.ssing interests 
than in de s.cabing symptomsT Among tHeltems to which they respond “like,” 
"indifferent,” or “dislike” are the following: 

To write stories. 

Having people take me for older than I am. 

Pretending to be the hero or herome in a movie. 

Helping to get meals ready at home. 

Dancing or singing or telling stories to amuse a group. 

These items are gj:gup.edjntp .a JftTgi^number of categories, ranging from 
overt interests such as “Mathematics” to categories based on psychiatric 
theory such as “Acceptance of own impulses,” “Solitary,” and “Fantasy life.” 

The.S£ores are used as a basis, for iutpiqiretation of the personality as a unit. 
This interpretation, which is a subjective synthesis of all theTactsHBrought fo" 
light by the test, qgnsiders adjustm ent less as a count of sympto ms than as a 
T®Mii3n.be^egnJlesh'^,aa3 aciiw^es.’T^ psychologist Tn one case gleans 
suflBcient material from the test, without having seen the boy himself, to fill 
nearly six printed pages. The final summary is (38, p. 383): “In conclusion, 
one may say that Lyle probably does not get into open clashes with adults 
and is very likely to be academically a good student. His age-mates may elect 
him to class offices, but probably few of them, if any, accept him as a real 
member of the group. A number of youngsters are apt to be annoyed with 
him and make him the hutt of their jokes. Lyle’s main difficulties seem to be 
that although or because he has accepted prematurely the standards and val- 
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lies of a certain group of adults — ^his own emotional development has been 
warped and arrested.” 

19. Consider the following tests: MMPI, Bell, Bernreuter, Allport-Veinon, Interest 
Index. 

a. Which ones would give the most useful information to a high-school coun- 
selor? How could the information be used? Disregard ease of administra- 
tion. 

b. Answer the same questions from the point of view of the personnel manager 
of a moderate-sized o£Bce. 

c. Answer the same questions from the point of view of a clinical psychologist 
attached to a juvenile court. 

20. What “traits” are discussed in the portrait of Lyle based on the Interest Index? 
Why would not a test scored for these toaits serve the same function as the 
index? 

Classification Inventory 

The MMPI sorts out abnormal personalities by a direct empirical method, 
rather than tHrougli descriptive ti'aits. Jurgensen has appr oached employee 
selectiQnjn-4L.similar manner (26). His Classificatio n Inve ntory Is also note- 
worthy in its use of the forced-choice technique. Jurgensen obtains a large 
number of self-repprts_a.bout irritations and preferences. One item, where the 
subject indicates which alternative is most, and which least, irritating, is 

People who (a) Laugh at you 

(b) Are sarcastic to you 

(c) Go.ssip about you. 

There are no st anda rd keys for the test. A new key.is made in each employ- 
ment sitiiia.tiQjQ, where, the feTt’Ts'To "Be used. The test is given to employees 
known to be good or poor on a particular job, and a key is made up of those 
responses which differentiate good and poor workers. The test must be vali- 
dated anew in each new application. 

21. What advantage, and what disadvantage, comes from making a new key for 
each selection task, ratlier than establishing general keys for nation-wide use 
for such occupations as salesman, accountant, etc.? 

22. Assume that a descriptive inventory gives reliable scores on five traits. 

a. How could an industrial psychologist using such a test determine what 
score or score-pattern indicates probable success on a particular job? 

b. If this method, and Jurgensen’s, gave equally good validity coefficients, 
which type of test would be preferable in a personnel program? 

Other Tests 

Several devices were developed during the war for the identification of 
possibly unstable recruits.TEe two most widely used were the Shiirle v Per- 
sonal Inventoi-y (PI) and the Cornell Selectee Ind ex. Each has been used in 
a variety of editions. They are noteworthy because they were generally found 
to be reliable despite their brevity, and because several studies have shown 
satisfactory validity (20, 37). 
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The PI^ is in many ways like the Classification Inventory. It nses„fprced 
choices, such as 

(a) I wish I didn’t have so many aches and pains. 

(b) I wish I wouldn’t keep changing my mind. 

Choosing between undesirable alternatives is much less open to falsification 
or ambig uity than the usual singk question. Even a psychologically sophisti- 
cated subject often camiot "guess the “good” answer. The PI also.reflects tilie 
trend tovvard empirical keying of items. In^the. example above, one c hoice 
is Scored as indicating maladjustment, not because it is a well-known symp- 
tom of disorder, but because that response is given more often by deviates 
.ifiaiby-normals. 

JRapport is a s erious prob lem wth. forced-choice tests. Guilford is very 
critical of the PI because some examinees in Air Force testing resented the 
forced choice. “Some of the items appear unsuitable for use in the selection 
program, because they present a choice between an acceptable and unac- 
ceptable response (e.g., T get embarrassed easily' — 1 seldom get embar- 
rassed’), or because both items are socially unacceptable and of the type of 
‘Have you stopped beating your wife?* (e.g., ‘Our family scraps often came 
after someone had been drinking’ — ^‘Drinking never was the cause of our 
family scraps’ ) ” ( 17, p. 607 ) . 

Facts about the more widely known tests of personality are summarized 
in Table 61. 


Table 61. Representative Personality Questionnaires 


Test and Publiihar 

Age Level” 

Number 
of Items 

Variables Scored; 
Special Features 

Comments of Reviewers^' 

Allport Ascendance- 
Submission (A-S) 
(Houghton Mifflin) 

Men; 

women 

34 

Dominance 

"Very doubtful that scale can 
be employed without veri- 
fication, supplementary in- 
formation* (Humm, 7, 
p. 50). 

Allporl -Vernon Study 
of Volues (Houghton 
Mifflin) 

College 

and 

adult 

45 

Six value types identi- 
fying evaluative at- 
titudes 

"Validity fairly good, e . . 
Social value score unre- 
liable, ambiguous. . . . 
With suitable caution, can 
be recommended as . . . 
having considerable 
value" (Meehl, 8). 

Adams-Lepley Per- 
sonal Audit (Psych. 
Corp.) 

Adult 

450 

Impulsiveness, irrita- 
bility, tolerance, 
etc. 

"Claims made for this instru- 
ment go beyond the 
data" (Symonds, 8). 

Bell Adjustment Inven- 
tory (Stanford) 

Grades 9- 
1 6; adult 

140 

Adjustment: home, so- 
cial, job, etc. Items 
selected by external 
criterion. 

"Useful in a personnel pro- 
gram" (Doriey, 7, p. 52). 
"Con be recommended to 
the clinician" (Beck, 7, p. 
53).'- 


** Semicolon indicates separate forms of teat. 

^ The reader should refer to the source cited for a complete statement of the reviewer^s opinion, 
* Body of text carries more complete reports on validity of these tests. 
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Test and Publisher 


Bernreuter Personality 
Inventory (Stanford) 

California Test of Per- 
sonality (Calif. Test 
Bureau) 


Classification Inventory 
(C. E. Jurgensen) 

Cowan Adolescent Ad- 
justment Analyzer 
(Cowan Res. Pro|., 
Salina, Kans.) 

Humm-Wadsworth 
Temperament Scale 
(D. G. Humm) 


Interest Index 
8.2a, b, c 
(Prog. Educ. Assn.) 
Inventory of Factors 
S-T-D-C-R (Sheridan) 


MeFarland-Seitz Psy- 
chosomatic Inventory 
(Psych. Corp.) 

Mailer Character 
Sketches (Teachers 
Coll., Columbia) 
Minnesota Multiphasic 
Personality Inven- 
tory (Psych. Corp.) 


Pintner Aspects of Per- 
sonality (World 
Book) 


Rogers Test of Person- 
ality Adjustment (As- 
sociation Press) 


Number 
Age Level" of Items 

Adolescent 1 25 
and 
adult 

Grades 96 to 1 80 

Kgn.-3; 

4-8»7-10; 

9-coll.> 

adult 

Adult 288 


Age 12-1 8 90 


Adult 318 


High school 600 


Adolescent 175 
and 
adult 


Variables Scored; 
Special Features 

Self-confidence, soci- 
ability, etc. 

Self-adjustment, social 
adjustment. Twelve 
subscores. 


Keys prepared to select 
good employees on 
particular jobs 
Areas of adj.: fear, 
family emotions, ma- 
turity, etc. 


Hysteroid, manic, 
epileptoid, etc. 

Wide use in industry. 


Numerous interest cate- 
gories. Clinical in- 
terpretation possible. 

Five ospects of intro- 
version. Based on 
factor analysis. Two 
other Inventories in 
this series. 


Comments of Reviewers^ 


“No convincing evidence of 
validity" (Mailer, 7, p 
1 89).' 

"Subtest reliabilities far too 
low. . . . Validity en- 
tirely unestablished" 
(Shaffer, 8). 


"Despite weaknesses, one of 
the better tests for adoles- 
cents, used by a qualified 
child psychologist in the 
clinical situation” (Snyder, 
8 ). 

"Room for more studies on 
reliability, validity. . . . 
Found it very useful in in- 
dustrial situations, par- 
ticularly on supervisory 
level" (Meltzer, 8 ).'' 


“A distinct advance in lest 
construction but [lacking] 
in validation data, ... a 
test for the experimenter 
rather than for the practi- 
cal tester in clinic or fac- 


tory" (Eysenck, 8). 

Adolescents 92 Neurotic tendency; "Unusually high validity. . . . 

ond stresses physlologi- Possible distorting fae- 

odults cal symptoms. tors" (Mosier, 7, p. 77). 

Grade 5- 200 Six aspects of adjust- 

college ment. Indirect questions. 


Adult 495 Nine clinical cote- "As a general screening test, 

gories. Empirical very valuable in psyehl- 

construction. Scales otric clinics. ... No 

to detect invalid re- great reliance may be 

spouses. placed in the belief that 

the subscales measure 
what their titles suggest" 
(Rotter, 8)." 

Grades 105 Ascendance, introver- "No reason for believing 

4-9 sion, emotionality. teachers can safely de- 

Worded for children. pend on an Individual's 

score. ... As a clinical 
tool, very satisfactory" 
(Louttit, 7, p. 56). 

Age 9-1 3. Six sec- Inferiority, social ad- "In our clinic the mosi satis- 

Boyj; tions justment, family ad- factory method of per- 

9ltl» justment, daydream- sonolity measureme.nt" 

Ing. Stresses clinical (louttit, 7, p. 94). 

analysis. 
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23. Consider the method of item selection and keying for the tests discussed in tlie 

text. 

a. Which tests use items selected by an external criterion of the trait mea.sured? 

b. Which tests use items selected by an intenial-con.si.stency test only? 

c. Which tests use items .selected to fit the definition of the trait, without direct 
empirical checking? 

Comparison of Tests 

The tests described above differ in atieast the following respects 2 _ 

Turpose: clinical description (MMPI, Rogers) or screening (Bell, Ship- 
ley PI). 

Theory of personality: explicitly assumed types ( Plumm-Wadsworth, 
Allport-Vernon), defined traits (Bernreuter), personality as an inte- 
grated whole (Rogers, Interest Index), or no particular theory (Jur- 
gensen ) . 

Length: 20 items (PI, short form) to 500 items (MMPI). 

Control over response sets, honesty: control scores (MMPI, Humm- 
Wadsworth), forced choice (Allport-Vernon, Jurgensen), study of re- 
sponses item by item (Rogers), or no conti-ol (Bell, Bernreuter). 

RELIABILITY AND VALIDITY 
The Stability of Personality 

All p ersonality tests assuine„that habits are stable, at least over significantly 
long periods. Ratest correlat ions witli per sonality tests are genera lly hig h for 
all traits studied, even with long intervals between the tests. These correla- 
tions are, however, smaller than those obtained from immediate retests or 
parallel tests administered consecutively, which shows that some change oc- 
curs. Mailer states that some personalities are organized more stably than 
otliers, so that it is possible to obtain stable measures for some persons but 
not for all ( 28 ) . 

Description of Personality 

The most ambitious purpose of personality tests has been to describe in- 
dividuals. Tests which gi ve only one score have been intended for scre ening 
and prediction^ but other tests yield several scores describing personali^, or 
ev en complete Lportraits of cases . To study validity .of descriptions, jpossible 
(^criteria a re oth er jijelfaspoit-tests, data fr om normal behav ior and stmid- 
jurdized observations , or clinical Topprts base^TTHetaiTed case studies. 

Comparison of different self-report indices can yield only negativ^ evi- 
dence of validity, since two tests might agree, yet both be invalid. There is 
general consistency among self-report tests of the same function, as would 
be expected from their duplicating content. This correspondence is not per- 
fect. Correlations between four tests supposed to measure introversion are 
shown in Table 62. These tests do not measure the same thing. The names 
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g iven to ti'ait s by test authors cannot be trusted;.-!Khat a test measures can be 
unde rstood only in .terms of detailed study of.tlie items or an outside cri- 
terion. 


Table 62. Intercorrelations of Four “Introversion" Scores for 
172 College Students (15) 


Name of Test 

Bernreuter 

laird 

Marston 

Northwestern 

Bernreuter 

.80“ 

.47 

.37 

-.09 

Laird 

.47 

.55 

.30 

.10 

Marsfon 

.37 

.30 

.84 

.25 

Northwestern 

-.09 

.10 

.25 

.62 


" Italicized Agures are split-half reliability coefficients. 


Attempts to describe clinical patterns of personality have been most often 
based on the MMPI. The authors of that test have found marked differences 
between patients diagnosed by psychiati-ists in various categories. Meehl 
made blind analyses of test records of independently diagnosed patients. The 
test score permitted him to classify those with abnormal patterns correctly as 
psychotic, neiuotic, or conduct disorders in about 60 percent of the cases 
(30). Less favoi-able results were obtained by Benton and Probst, who made a 
study of 70 patients. Psychiatrists rated the degree to which each patient, by 
his behavior, showed eacli deviate trend reported by the MMPI. This is a 
direct test of the MMH as a d escriptive tool. They found a significant degree 
of agreement between test and rating on Pd, Pa, and Sc, but small or zero 
agreement on Hs, D, Hy, Mf, and Pt (6). Further study of the test will be 
needed before descri ptions m ad e by it can be~3epenclecl uponTT lie recent K 
correction shnnTd iTipre aaeJjb.S accu racy of the test. 

For normal subjects, clinical criteria are rarely available. Ratings of be- 
havior by acquaintances are frequently used as a criterion, and usually show 
small positive correlations between test scores and opinions of judges. One 
of the highest sets of validity coefficients from ratings is reported for the Guil- 
ford inventory. Scores of 50 adult students of psychology were correlated 
with ratings made by close acquaintances. Correlations for five aspects of in- 
troversion (see p. 317) were .70 for factor S, .60 for D, .48 for R, .20 for C, 
and .08 for T (18). The low correlations for C and T are partly due to in- 
valid ratings. 

In a few s tudies, descriptions fr om personali^ tests of normals have been 
compared with extensive case material from other sources. SHeviaKbv used" 
tliis method to validate the clinical interpretation^ the Interest Index for 
Lyle. The results showed that the descriptions were far better than a blind 
guess, but not adequate as a basis for action without further verification ( 38 ) . 
To validate the Humm-Wadsworth profile, subjects were rated on each com- 
ponent of personality by the test and also from a case history. Almost in- 
credible validities, from .83 to .92, were reported (24). It is unfortunate 
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that published confirming studies under independent investigators arc not 
available as a basis for judging the test. It is to be noted that validities are 
much lower, according to the published data, for cases with extreme response 
biases, such as a tendency to say “no” in answering questions. When these 
cases are identified by a control score and dropped out, the validity coef- 
ficient rises, e.g., from .85 to .94 ( 22). No data have been published indicat- 
ing the validity of extreme scores after correction for response sets, a pro- 
cedure only recently introduced. 

Screening of Maladjusted Cases 

In industry, die armed services, and some school and clinical situations, 
self-report tests might be helpful in screening. Since screening is a “pass or 
fail” operation, validity is best judged on the false positive and pickup rates. 



Normals (N=458) Dischargees (N=30) 

Fig. 75. Pickup and false positive rates of PI at various score levels 

(after 37). 

A test identifying probable deviates must not screen out too many normals 
( false positives ) since the burden of examining further so many men is trou- 
blesome. At die same time, the test must pick up nearly all the deviates. For 
practical work, as distinguished from research, it is necessary to determine 
the efficiency of selection rather than wliedier a significant positive relation 
between test and criterion is present. Evaluation must consider Jhe.JMfmber 
of false positives, rather diar^the percentage of such, errors. In large popula- 
tions, a small rate of error may force the psychiatrist to study an unneces- 
sarily great number of cases ( 39 ) . 

Hunt and Stevenson discuss at length the military problem of screening 
and the importance of adjusting the critical score in the light of practical 



SELF-REPORT TECHNIQUES: PERSONALITY 


331 


conditions. They conclude that self-report tests have proved much more suc- 
cessful in military screening than in similar civilian situations (25). The fol- 
lowing studies are typical of the results found. 

Figure 75 shows results with the short form of the Personal Inventory ap- 
plied to recruits. These data show how test results correspond with the judg- 
ments of psychiatrists who examined all men. More unfit men than normals 
make high (undesirable) scores on the test. Out of 30 dischargees, 12 made a 
score above 9, but only 13 normal men out of 458 were falsely screened by 
this cutting score. This cutting score, however, would miss 18 dischargees. A 
deeper cut, screening men at 5 and above, catches 21 of the unfit men, but 
the price of this better identification is that 108 false positives must also be 
individually examined. Such a screening device will always miss a few men, 
but it is clearly more effective than no device at all. The cutting score used 
will depend on the importance of eliminating nearly every unfit man at this 
stage of processing, and on how many men the psychiatrists have time to 
intemew. 

24. What percentage of dischargees have scores of 12 or over? Of 3 or over? What 
percentage of non-dischargees have these scores? 

25. In selecting men for responsible duty, as in submarines, what cutting score 
would be best, assuming that the test measures a factor important on the job? 

A study of marine recruits used an inventory of 52 items, constructed to 
meet the needs of the specific situation. The inventory, together with one of 
the Kent “emergency” mental tests, was used to provide information to the 
psychiatrist interviewing the men. With this plan, it was possible to process 
400 men in 2Js hours with only 2 psychiatiists. Although the investigators feel 
that their criterion was somewhat contaminated by factors not experimen- 
tally controlled, the inventory did identify potentially unsuccessful men. The 
test was given to about 3500 men. About 600 out of 900 who failed training 
(66 percent pickup) had scores of 9 or over; such a cutting score would yield 
about 250 false positives (10 percent) (32). 

An adaptation of two short questionnaires given 2081 Seabees successfully 
identified 281 cases later adjudged as presenting neuropsychiab'ic conditions 
by psychiatrists, missed 16 who came to the psychiatrists’ attention through 
diflBculty on duty, and falsely screened 244 men judged normal upon further 
study (20). This means that the tests permitted psychiatrists to omit indi- 
vidual interviews with 1540 men — not a trifling saving. 

Attempts to screen students to find those requiring counseling have been 
more disappointing. Darley, for example, tested over 800 students at the Uni- 
versity of Minnesota. Each student was interviewed repeatedly during the 
year by a counselor, after which a diagnosis of the kind and extent of mal- 
adjustment was made. The Bell Adjustment Inventory had identified cor- 
rectly 40 students having problems relating to home adjustment, but missed 
41, and produced 73 false positives. On emotional adjustment, there were 32 
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hits, 75 misses, and 42 false positives (11 ). In a study of the Bernreuter test 
at the University of Iowa, an exceptionally good ciiterion was used — observa- 
tion records on 81 girls gathered continually during the year. Of 16 girls at 
the maladjusted extreme on the Neurotic Tendency scale, only six were con- 
sidered actually maladjusted, whereas two of those least maladjusted, ac- 
cording to the test, were rated maladjusted on the criterion. The Self-Confi- 
dence scale was more successful. Ratings agreed witli test scores for 10 of 10 
girls showing extreme lack of confidence on the test, and for 6 of 8 whose 
test scores showed high confidence ( 13 ) . 

Although errors are too frequent to .warrant trust in these devices as 
screening tools, scores have a vahdity better than chance. Many studies have 
found relations such as those reported by Stogdill and Thomas. They found 
that the mean scores of students reporting voluntarily for counseling were 
significantly deviant on both Flanagan keys for the Bernreuter. The correla- 
tion of Self-Confidence scores for men with ratings on degree of maladjust- 
ment was .59. Overlap in scores between normal and counseling groups was 
too great for screening validity, and correlations with ratings were of negli- 
gible size in a group of probationary students referred for assistance (40). 
Tire test is more likely to be valid in groups seeking assistance than in groups 
^^6 are uncociperatim. Despite its lack of precision in individual measure- 
ment, the Bernreuter test describes tendencies in groups sufficiently well so 
drat several writers have endorsed it for research use (41). An invalid test, 
however, may conceal relationships which it would be important to discover. 

Identifying problem cases by self-report methods at earlier ages has 
proved even more difficult. Investigators who compare groups of known de- 
linquents with normals find some differences in scores, but the scores of 
the groups overlap so much as to discourage use of the tests for screening 
(5, 9). Wittman and Huffman, for example, found about 20 percent of de- 
linquent boys show poor emotional adjustment on the Bell test, but that 15 
percent of non-delinquent boys do equally poorly (44), 

Studies of the value of personality tests for screening of undesirable em- 
ployees are at present inadequate for a final judgment of the method. 

Two studies offer promise in the application of tests to abnormal patients. 
Hatlraway applied tlie Bell, Bernreuter, and Humm-Wadsworth tests to nine 
cases of “constitutional psychopathic inferiority.” The Bell inventory gave no 
useful indications, but scores on the other tests showed suspiciously “super- 
normal” patterns for these cases. Such psychopaths were expected to be un- 
inhibited and lacking in feeling of inferiority. On Bernreuter scale N they 
scored in the highest 10 percent, some going "off the scale” in the direction 
usually considered good adjustment. On the Humm-Wadsworth the group 
was average or better on every component. Hathaway considered this posi- 
tive evidence in favor of test scores, clinically interpreted (21). Neymann 
and Kohlstedt prepared a yes-no questionnaire and applied it to diagnosed 
schizophrenics and manic-depressives. The distribution was bimodal, with 
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Fig. 76. Distribution of normals, manic-depressives, and schizo- 
phrenics on the Neymann-Kohlstedt test (35). Curves represent scores 
of 35 manic-depressives, 35 schizophrenics, 44 clinic patients showing 
no mental abnormality, and 44 medical students. A positive score in- 
dicates “extrovert" responses. 

little overlap of the two groups. Schizophrenics ranged from —1 to —50, 
while manics ranged from —6 to -1-50. Ninety-three percent of the patients 
were correctly classified ( 34) . A repetition of the study produced similar re- 
sults, shown in Figure 76 ( 35 ) . The difiFerence between the psychotic groups 
is significant and sugg ests that tesFscbre? ihay aid in classifying psychotics, 
^utno t ne cessarily in screen ing them from unselected groups. 

Prediction of Success 

Self-report personalily. tests have been used to predict .such performances 
as sch ool grades, success in industry, and success in selling. Attempts tp^pre- 
di ct sch o ol grade s by self-report tests have been completely unsuccessful. An 
occasional study finds a correlation as high as .40, but tlie consensus is that 
a particular personality trait may be helpful or harmful in attaining good 
marks, depending on the way that trait is organized into the total personality. 

Prediction of pilot success by personality inventories was not generally 
successful. Nearly every published test was tried on at least a small group, 
and tlie correlations with graduation were small. Of all tests, the highest cor- 
relations were obtained for the Humm-Wadsworth scores Hysteroid and 
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Epileptoid, coi’rected for xeapDiise.SLets. These r’s were only .19 and .22, but 
might add to a multiple correlation (17, p. 583). Such correlations do not 
consider the significant patterns of scores which form the usual basis for in- 
terpreting this and several other tests. 

In business, there have been a few striking successes with the method. One 
practical tool is the Aptitude Index, which was designed for use in life in- 
surance agencies,^ It is a combination of personal history and personality 
questionnaire, scored by an empirical formula develojied by trial on agents. 
The correlation with sales volume is .40. Although this is low for individual 
prediction, it leads to considerable improvement of average sales per man 
in tlie long run. Eliminating from consideration those men who dropped from 
the job in their first year, it was found that men rated A produced 206 percent 
as much business as the average man, while men rated E produced only 41 
percent of average ( 27 ) . 

The experience of Wonderlic and Hovland with Household Finance Cor- 
poration is consistent with this. They report: “Our early investigations were 
carried on witli published tests which have been standardized by others. . . . 
We were unable to find any test in which the total score was significantly 
prognostic of success in our organization to warrant its inclusion as part of a 
selection program. In the cases of many of the purchasable personality tests, 
results were obtained whicli ran counter to expectations. Clerical workers 
seemed to be more aggressive than salesmen, salesclerks were higher than 
managers” (33,p. 60). 

They describe a procedure of selecting a new set of items which distin- 
guished between good and poor employees. This new scale, tliey report, is 
definitely helpful in selection. Validation figures are not available. They point 
out that whereas measuring the most favorable personality pattern for their 
particular work has been successful, it would be necessary for any other 
organization to develop its own test by empirical studies in its own plant. 
This is also the argument behind Jurgensen’s test. 

In contrast to this, some attempts have been made to suggest to counse- 
lors the personality patterns appropriate for given vocations. Thus Harmon 
and Wiener suggest the use of the MMPI in veteran’s counseling (19). High 
Hs, D, and Hy, they suggest, may show unwillingness to take dirty or heavy 
jobs, and desirability of avoiding stiess. High Pa and Sc indicate the desira- 
bility of relatively routine work. Such recommendations are based on the as- 
sumption (often incorrect) that the person who has a high Paranoid score 
has paranoid behavior; and so on. Thinking of this type is most helpful when 
tests are used as Moore suggests ( 33, p. 32 ) : '“Iliey provide leads for the 
interview, so that the astute examiner may use them as starting points to dis- 
close attitudes, values, feelings, and meaningful experiences. They open the 
possibility of discussing issues to which the candidate has given some con- 


^ Publisher: Life Insurance Agency Management Association. 



SELF-REPORT TECHNIQUES: PERSONALITY 


335 


sideration, which may lead to the disclosure of significant experiences and 
attitudes that are destined to affect activity in future work. Their most re- 
liable value is in their aid in spotting those with extreme tendencies. . . . 
However, to infer that the presence of these tendencies is a guarantee of ad- 
equacy or ti'ouble is to make an assumption that has not been supported by 
any type of follow-up study.” 

Several studies of marital happiness have used personality inventories. 
Adams, for example, tested students before marriage, and measured their 
marital happiness by a questionnaire after they had been married six months 
or longer. For 100 couples, the Terman Prediction Scale, which is a self-re- 
port device, correlates .32 (men) and .38 (women) with happiness ( 1 ). Per- 
sonality traits measured by the Personal Audit had no predictive value. 

26. How might lack of self-confidence help one student to attain high marks, yet 
be a drawback to another? 

27. How adequate is a validity coefficient of .30-.40 for advising a particular 
couple whether to marry? 

Conclusion 

There are few hotter arguments in testing today than whether, self-report 
pe rsonality measures should be trusteji. It is clear that the hopes. ongiiraUy 
heljLfor. -them- have- not been fulfilled , bu t tiiev have not-proved worthless 
in all situations . The range of opinions currently held is represented by these 
two quotations. Patterson, who made a careful study of the literature on the 
Bernreuter test, which of all tests has been most widely studied, insists ( 36 ) : 
“It must be concluded that the results of studies using the Bernreuter Person- 
ality Inventory ai'e almost unanimously negative as far as the finding of sig- 
nificant relationships with other variables is concerned. ... It is no doubt 
largely due to the nature of the questionnaire approach, which appears to be 
a fniitless technique for the study of personality.” 

While test manuals and advertisements still make extreme claims for some 
.^self-rej^fTfesfs^such claims are rejected by most psychologists. Baker, dis- 
OTSsing ffle"diagnbsis of children, represents the cautious position taken by 
those psychologists who, in good faith, uphold use of the tests ( 4, pp. 379- 
380); “It is generally known that children’s problems are so close to their 
lives that they can scarcely refrain from answering what applies to them. This 
situation is similar to the quite universal tendency of most individuals to un- 
burden themselves about their problems even to strangers if they are en- 
couraged to talk about themselves. As such tests are becoming more practical 
and cover more areas, their uses are certain to increase when economy of 
time and usefulness become more apparent. It should be understood that 
they are not an end in themselves but should be the basis for further inter- 
view by trained workers.”® 

®From Baker: Introduction to exceptional children. Copyright 1944 by The Mac- 
millan Company and used with their permission. 
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Much of the_difficiilty in. pei’sonality. testing has no doubt arisen from in- 
adequate preparation of instruments and inadequate theories of persohaliiyT 
Certainly there is a wide range of value from the best self-report tests to the 
poorest, With more careful research and critical application of tests, this 
technique should be increasingly helpful, provided the following principles 
are kept in mind: 

1. A “poor” score on a personality inventory probably indicates a person 
who should have further attention; a “good” score does not guai'antee the 
presence of “good” qualities. 

A tes t may b e assumed to be valid with a particular group only when 
validity has heen proved by experimentation on such subjects. Tests stand-' 
ardized oh college students are unlikely to work without change for em- 
ployees, patients, or soldiers. Empirically developed scori ng is far superior 
to scoring keys constructed in ah armchair. 

V s. A self-report test can never be used as a final basis for any decision in 
counseling or disposing of an individual. It performs its most useful function 
in suggesting to the psychologist possible facts about tire individual to be 
confirmed by furtlrer study of him. 

4. Tests having control scores to identify unusual response sets, or tests 
using a forced-choice desigir, are generally superior. 

5. Results are far more trustworthy if the subject desires to give a true 
picture of himself. 

28. Among the studies cited above reporting positive validity, which were based 
on tests designed for the particular sort of prediction being made? Which were 
based on tests designed for u.se in many different situations? 

29. How would self-report tests be useful in a college guidance program? 

30. What factors made for the unusual usefulness of self-report tests in military 
processing? 
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CHAPTER 15 


Self-Report Techniques: interests 

NEED FOR INTEREST TESTING 

An interest may be defined as a tendency to seek out an activity or object, or 
a tendency to choose it rather than some alternative. Knowing the interests 
of an employee helps us predict whether he would be satisfied if transferred 
to a new type of work; knowing the interests of a juvenile delinquent suggests 
activities which might keep him satisfied at home; and knowing the interests 
of a pupil helps the teacher to suggest projects, topics for writing, and social 
experiences which he will pursue more eagerly than others. 

Interests, can rarely be determined adequately by a direct question. Tliere 
would be no need for tests if one could obtain from a subject a valid list of 
his major preferences. The question “Would you like to be a physician?” does 
not help greatly to detennine whether a boy should consider medicine as a 
career. Apart from the usual problems of obtaining honest self-report, an- 
swers are often misleading. Frequently the response is to the symbol rather 
than the activity; medicine is respected, and the boy may feel that he should 
like anything respectable. Answers are often based on ignorance or super- 
ficial understanding of the activity. Girls may reject teaching for no better 
reason than that they think correcting papers would be tedious, little realiz- 
ing the numerous other activities in a school day. Some boys choose law 
because it calls for public speaking, ignoring the long hours of isolated re- 
search and thinking called for. Counselors encounter numerous students and 
employees who are vocationally misplaced even though they had thought 
they would be interested in the work and enjoyed much of their training 
for it. Having made a choice, one often tends to defend that choice as correct; 
thus many people continue in activities, not facing the fact that they wish 
they had chosen differently. Under such circumstances, a simple question, 
will not reveal the true interest pattern. Therefore, interest tests seek reac- 
tions to a mixed list of activities, some apparently remote from the interest . 
under di scus sion. By such indirect approaches insight into true interests 
is obtained. 


EMPIRICAL CONSTRUCTION OF INTEREST TESTS 
Assumptions and Procedures 

One of the two most widely used tests of interests is the Strong Vocational 
Interest Blank, first published in 1927. It is a siiperior example of the eraplf- 
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ically constructed interest test.^ Sti-ong’s empirical test may be cpritoasted 
fco 'lKider’s Preference Record discussed below, which describes directly 
what the subject likes. 

' Strong's test makes use of the fact that people in a particular occupation 
have' roughly similar interests. It is then assumed that a person having the 
same pattern will find satisfaction in that field, but that one having dis- 
similar interests will not be happy in it. For example, Sti'ong identifies the 
interests which are cliai'acteristic of practicing engineers. In giving the test 
to college students, those who have interests similar to adult engineers are 
advised to consider engineering as a vocation, and those who have low scores 
in “engineering interests” are advised that they may not enjoy the work of 
the engineer. 

The .second assumption necessary to the Strong blank is that i nterest s are . 
fairly constant. The test is ordinarily given to late adolescents and youn g 
adults, and an attempt is made to predict what occupations the subjects 
will find satisfying as a lifetime activity. 

The test was constructed by listing questions on hundreds of activities 
both vocational and avocational. Most items require a “like-indifferent-dis- 
like” report regarding activities or topics: biology, fishing, being an aviator. 


Table 63. Determination of Weights Used for Strong’s Engineer Key (21, p. 75) 


First 1 0 Items 
on Vocational 
Interest Blank 

Percentage 
of "Men-In- 
GeneroT 
Tested 

Percentage 
of &igineers 
Tested 

Differences in 
Percentage Be- 
tween Engineers 
and Men-in- 
General 

Scoring Weights 
for Engineering 
interest 


L 

1 

0 

L 

, 

D 

L 

n 

■1 

n 

n 

D 

Ador (not movie) 

21 

32 

47 

9 

31 

60 

-12 

-1 

+ 13 

-1 

0 

1 

Advertiser 

33 

38 

29 

14 

37 

49 

-19 

-1 

+20 

-2 

0 

2 

Architect 

37 

40 

23 

58 

32 

10 

+21 

-8 

-13 

2 

-1 

-1 

Army officer 

22 

29 

39 

31 

33 

36 

+9 

H-4 

-13 

1 

0 

-1 

Artist 

24 

40 

36 

28 

39 

33 

+4 

-1 

-3 

0 

0 

0 

Astronomer 

26 

44 

30 

38 

44 

18 

+ 12 

0 

-12 

1 

0 

-1 

Athletic director 

26 

41 

33 

15 

51 

34 

-11 

+10 

+ 1 

-1 

1 

0 

Auctioneer 

8 

27 

65 

1 

16 

83 

-7 

-11 

+18 

-1 

-1 

2 

Author of novei 

Author of technicoi 

32 

38 

30 

22 

44 

34 

-10 

+6 

+4 

-1 

1 

0 

book 

31 

41 

28 

59 

32 

9 

+28 

-9 

-19 

3 

-1 

-2 


planning a sales campaign, etc. The list of questions was given to successful 
members of a particular profession, and the interest pattern for that pro- 
fession was determined by comparing the responses of the group with those 
of unselected men of similar age. A weighted scoring key was then prepared 
to give each person a score reporting how his interests correspond with those 
of the professional group. 

An impression of die amount of effort involved in thorough and painstaking test con- 
struction may be obtained from Strong’s report that over $45,000 was spent, and blanks 
from over 23,000 people were obtained, in the course of his research (21, p. x). 
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Table 63 illustrates the plan by which the key was constructed, On each 
item, the percentage of men-in-general giving each answer was compared 
with the percentage of men in the occupation giving the answer. Engineers 
dislike “Actor” more often than other men; therefore, response D was as- 
signed a positive weight in the engineer scale. The weighting is pro- 
portional to the difference; liking to be author of a technical book is es- 
pecially significant of engineering interests. In contrast to the engineers, 
40 percent of artists respond “like” to “Actor.” The weights of “Actor” in 
the artist scale are 2 for L, 0 for I and — 1 for D (21, p. 73). 

Strong - develo ped keys for 39 occupations (printer, musician, etc.). A 
blank for women was also devise^, which can be scored for nurse, stenog- 
rapher, dentist, ~and 15 other occupations. Keys for masculinity-femininity, 
occupational level (professional, skilled, unskilled, etc.), and maturity of 
interests have also been prepared. The scores on each variable are converted 
into grades A (70 percent of successful men in the occupation fall in this 
group), B, and C. 

1. Estimate the approximate weights for the chemist scale, on the item “Actor,” if 
responses of chemists are as follows: 16 percent L, 34 1, 50 D. 

2. Estimate the weights for the miusician scale of “Actor,” if responses of musicians 
are 34 percent L, 48 1, 18 D. 

3. Suppose you wished to make a key for the Strong blank for women to measure 
interest in being a mother, i.e., to predict whether a girl will enjoy raising a 
family. Outline the steps you woulcl follow to prepare the scale, with especial 
attention to the persons you would use as a basis for the key. 

4. Each of the following assumptions is implied in the construction or in some uses 
of the Strong test. For each one, state a contradictory hypothesis that might be 
reasonable. 

a. One is not likely to succeed in an occupation unless the work is interesting 
to him. 

b. One is not likely to succeed in an occupation unless his interests sure similar 
to those of most other men in the profession. 

c. Success and interest in the school subjects required for prepsuation for a pro- 
fession are not an adequate basis for predicting satisfaction sind success in 
the profession. 

d. The interests leading to success in a vocation in 1930 will also be the interests 
associated with success in 1950. 

A Typical Strong Scale 

The meaning of Strong test scores may be understood more thoroughly 
by examining a particular scale, say, “psychologist.” Strong prepared his 
key from responses of 192 members of the American Psychological Associa- 
tion. Since full membership in this association was then restricted to persons 
who had demonstrated ability in psychological research, these men were 
presumably successful psychologists. The usual key was prepared which 
assigned weights on the basis of the extent to which the psycliologists differed 
from unselected men. Among the weights are; 
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L 

1 

D 

Author of technical book 

4 

-4 

-4 

College professor 

4 

-3 

-4 

Employment manager 

1 

0 

0 

Social worker 

0 

1 

-1 

Undertaker 

-4 

0 

1 

Zoology 

4 

-3 

-4 

Musical comedy 

0 

0 

0 

Interviewing prospects in selling 

-3 

-1 

3 

Opening conversollon with a stranger 

-1 

1 

0 

Expressing judgments publicly without 
regard to criticism 

-1 

0 

1 


5. At the time of Strong’s study, the American Psychological Association included 
most teachers and research workers, but did not include a large proportion of 
clinical psychologists and applied psychologists. 

a. What limitation does this place on the interpretation of scores on the Strong 
“psychologist” key (7, p. 11)? 

b. Would it be advisable to make a new key, using as a standardizing group 
both academic and applied psychologists? [Research on new keys for the 
psychological profession is in progress.] 

6. Psychologists are generally required to do considerable mathematical work in 
their research, yet liking for madiematics is assigned a weight of zero in the 
Strong scale. How can this seeming inconsistency be explained? 

7. “An A rating in psychologist with B+ in phy.sician and dentist should suggest a 
different preparation and career than an A rating in p.sycbologist with B+ 
ratings in engineer, production manager, and carpenter” (21, p. 54). What dif- 
ferences in advice are justified in these cases? 

The Global Interest Chart 

For guidance, the total picture of the subject’s interests must be obtained, 
rather than scores inerely in the one or two occupations he considers huhself 
'Serested" in. Obtaining a complete profile on the Strong test is laborious 
and expen siye. Fortunately, factor analyses have shown grouping among 
interests, so that one can score the student for broad groups and then con- 
sider specific occupations in a few keys only. Clusters of interest found in 
the Strong test for men are ( 21 ) : 

Group I, Creative-scientific: Artist, psychologist, architect, physician, dentist. 
Group II, Technical: Mathematician, physicist, engineer, chemist. 

Group III: Production manager. 

Group IV, Sub-professional technical: Farmer, carpenter, printer, mathematics- 
science teacher, policeman, forest service. 

Group V, Uplift: YMCA physical director, personnel manager, YMCA secretary, 
social science teacher, school superintendent, minister. 

Group VI: Musician. 

Group VII: Certified public accountant. 

Group VIII, Business detail: Accountant, office man, purchasing agent, banker. 
Croup IX, Business contact: Sales manager, real-estate salesman, life insurance 
salesman. 

Group X, Verbal: Advertising man, lawyer, author-journalist. 

Group XI: President of manufacturing corporation. 



INTEREST GLOBAL CHART 



i:: 


Global Interest chart for Strong Vocational Interest Blank for Men. (Reproduced by permission of Stanford University Press.) 
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Strong has provided keys for groups I, II, V, VIII, IX, X. 

Strong’s global chart provides a simple picture of the major patterns of 
occupational interests. Figme 77 shows the chart, which is thought of as 
two faces of a sphere. Strong advocates locating the data for a student to be 
guided by circling first his primary interests (A scores), then secondary 
interests (B + ), and finally, the tertiary (B) scores. This permits a judgment 
based on the total significant pattern and draws attention to possible' com'- 
binations of interests. Cui'ves have been drawn on the chart to indicate the 
ratings of a psychology major, tested in his senior year. 

8. If the student represented in Figure 77 has done good academic work and has 
been satisfied in his courses in psychology, what possible vocational aims are 
suggested by the test? 

9. There are many groups in which tliis student has low scores. Which of these 
lacks of interest would be significant in deciding against certain psychological 
positions? 


A DESCRIPTIVE INTEREST TEST 
The Search for Traits 

It is not adequate for measurement of interests merely to catalog the 
thousands of specific interests a person may have. Yet a description of the 
particular interests of the subject would be helpful in working with him. 
Several scales have 
therefore been devised 
which attempt to de- 
sqribe the subjects Jn 
ter ms o f general fields 
- of intere st. As in descrip - 
..personality-- tests, 
traits ( such as “interest 
in clerical activities”_pr 
“interest in sports”) are 
the basis for description . 

Sti'o ng’s test is_not_a 
descriptive scale since , 
sco res„do - not tell just 
what the subject likes, 
bu t only that his inter- 
ests are like tho^ of a 
. j^dup.of me n. The i^sF 
useful descriptive inter- . 

^t scale is the Kud^^ 

Preference Record. 

Kiider’s test, lifee 



Fig. 78. Kuder Preference Record profile of 
Mary Thomas. 
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Strong” s, is primarily intended for vocational guidance. He began by 
listing a large" number of activities from work, school, and recreation. 
An analysis of correlations was then made to determine which of these 
interests were found together. The test includes seven independent clusters 
or areas of interest. Two additional scales were added which partly over- 
lapped two of the original ones. The final profile describes the subject in 
t erms of hj s int erests in mechanical, computational, scientific, persuasive, 
artistic, lite rary, musical, social service, and clerical activities. Tlie scores 
Me ^^sen ted as a profile, such as the one illustrated in Figure 78.“ Each 
Kuder ite m lias the form of a forced choice among three possible activities, 
the person being required to select the one he likes least and the one he likes 
most . An ex&mpls,is, 

a. Develop new varieties of flowers. 

b. Conduct advertising campaign for florists. 

c. Take telephone orders in a florist shop. 

A person who chooses “a” as most liked receives credit under Scientific and 
Artistic; choice of “h” is scored as Persuasive; and choice of “c” is counted 
as Clerical. 


Interpretation of Scores 

K uden girnfiles are interpreted on their face value. A person who shows 
high interest in Clerical and Computational (as in the profile of Mary 
Thomas) would be expected to enjoy positions demanding such activities. 
The low-interest areas are also important, since the person might dislike 
work demanding considerable activity of this type. In Mary Thomas’ case, 
the interest test was very useful. She was majoring in child development in 
college at the time she took the test. Her grades were mediocre, and her 
work with children neither especially good nor unsatisfactory. Wlren ques- 
tioned regarding her choice of major, she explained that she had set her 
heart on work in an orphanage. This desire had arisen in clrildhood when she 
read a book about a woman who helped orphan children, and this had 
seemed to her a “wonderful” thing to do as a lifework. The low Kuder scores 
in Persuasive and Social Service activities suggested a tendency to introver- 
sion; the high Mechanical, Computational, and Clerical scores suggested a 
liking for routine rather than creative activities. When questioned about of- 
fice work, she enthusiastically described her enjoyment of previous summer 
work as a file clerk; her duties apparently had consisted solely of alphabetiz- 
ing folders, yet she had “just loved it.” Moreover, she had done well in secre- 
tarial training courses. Evidently both ability and interest fell in the same 
area, one she had not considered as a vocational goal. 

Data have been gathered to show which interests are prominent among 

^The Profile Sheet for the Kuder Preference Record for Women, Science Research 
Associates. (Adapted by permission.) 
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persons successful in a given occupation. From the average scores for an 
occupation listed in Kuder’s manual, the student can determine, as in the 
Strong test, whether he has the normal interests of those successful in a 
chosen occupation. Kuder is now releasing formulas for computing, from the 
nine descriptive interest scores, occupational interest scores similar to 
S tong’s. These formulas are based on differences between men (or women) 
in an occupational group and men (or women ) in general. 

10. What tentative conclusions can be drawn about a college man with the follow- 
ing percentile scores: Mechanical, 50; Computational, 30; Scientific, 70; Per- 
suasive, 98; Ai'tistic, 70; Literary, 90; Musical, 50; Social Service, 40; Cleri- 
cal, 15? 

11. A boy who is majoring in business administi'ation shows high interests in 
Persuasive and Social Service. He is near average in Clerical and Computa- 
tional. He has a high score (78th percentile) in Scientific, which is not usual 
among business managers. What suggestions do these findings justify? 

Comparison of the Strong and Kuder Tests 

The Strong and Kuder tests, both widely used, illustrate th e practi cal 
problem of choosing between tests having similar purposes. The practicaF * 
‘differences between the tests are small except for cost of scoring, The tests 
r equire about the same time and botli are eas ily administered. The test ma- 
teri als are,coraparable in cost. S^comig the Strong test on all possible occupa- 
tions requires mlich clerical work. The charge of 70$! per person made by 
one agency for complete scoring service indicates the order of magnitude of 
the cost. The l^ider test may be scored by the subject himself in . a, few-min- 
utes, so that scoring cost is negligible, Several investigators have urged a 
simple, unweighted scoring for the Strong blank, and offered evidence for 
its adequacy, but Strong considers tliese proposals unsound ( 14, 23 ) . 

gives the most useful information depends upon one’s purpose. 
The Kuder focuses attention on the nature of tire activities enjoyed, whereas, 
ffie Strong focuses on the persons to whom one is similar. The logical validity 
of the Kuder is accordingly high, since a high mechanical interest score 
indicates self-reported liking for the activities listed in the Mechanical scale. 
The case for the Strong test rests on empirical validation, since a high score 
on the Carpenter scale does not mean one would like all the activities of the 
carpenter. 

How students rate the tests is indicated by reports of 50 high-school 
seniors (boys) who took seven interest inventories. The Kuder test was over- 
whelmingly judged easiest to comprehend, easiest to mark, and mechanically 
attractive. On the question, “In which did you find the items of greatest 
interest? 17 said Strong and 9 chose Kuder; the Gentry and Cleeton tests 
were chosen by 11 and 10 boys respectively. The. crucial question was 
“Which yielded results .which were most satisfying to you?” There was little 
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iinifoimity of choice; Strong received 15 choices, Gentry 13, and Kuder 10 
(13). 

The einpirical test is somewhat more indirect and subtle. It is probably 
easi^to fake high interest in an area on a descriptive test than to guess how 
respons es are scored on the empmcal test. This gives the empirical test an 
advantage when self-report is likely to be false ( 18 ) . 

""The' tests apply to the same age range. The Strong blank for women uses 
diflFerent questions from the blank for men, whereas the Kuder uses the same 
questions but bases profiles on separate norms for each sex. Because the 
Sb'ong test for women is less complete than that for men, Darley advocates 
the use of the men’s blank for prospective “career women” in college (7, p. 
13). 

Most interests measured by both tests have correlations near .60 (25). 
The correlations are in themselves some evidence that the tests are valid, 
since tests made by very different techniques give similar pictures of the 
individual. At the same time, there is evidence that scores with similar names 
are not always interchangeable. Whereas Kuder Scientific scores showed 
marked agreement with Strong Chemist, Engineer, and Dentist, there was 
almost no correspondence of Kuder AiHstic and Musical with Strong Artist 
and Musician, respectively ( 26 ) . Interest tests can only he validly interpreted 
by a person who has stuped each blank carefully to learn its idiosyncrasies. 

12. Compare the extent to which the two tests are subject to respon.se sets. 

13. Which test would be most useful for counseling a group of high-school stu- 
dents with both professional and non-professionalinterests? 

14. Which test would give you the most assistance if you were seeking vocational 
guidance (now or at an earlier age)? 

15. It is desired to use an interest test for helping college freshmen make voca- 
tional plans. Among the training programs offered are veterinary medicine, 
dietitian, and others not directly represented in the Strong and Kuder keys. 
How could each test be extended to assist students in judging whether they 
would like these fields? Which test seems to be more adaptable? 

Other Interest Measures 

In addition to vocational guidance, interest tests may be used for other 
purposes. Interest items commonly play a part in personality and adju.st- 
ment inventories. Several of the tests described in the previous chapter are 
essentially mea sures of i ntere st, notably the Allport-Vernon te st and the 
Inte rest I ndex. They are used , however, to obtain information about basic 
pers onal characteristics. Pressey, in his Interest-Attitude Test, determined 
which items were checked by subjects in varying age groups and, by scor- 
ing the items marked by older subjects, obtained scores which represent 
“interest maturity.” 

16. How might an interest test be used to distinguish, among prospective teachers, 
those likely to be traditional subject-matter teachers from those likely to 
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emphasize tlie development of the pupil as a person? Outline a plan for re- 
search to develop such a procedure (5). Why might such a test be more 
successful than a questionnaire on educational beliefs? 

17. How could a psychologist design empirically a key for the Kuder or Strong 
test to measure the "emotional adjustment” of subjects? 

Several investigators have sought to avoid the errors of self-report by 
measuring what the subject knows about each area, assuming that one ac- 
quires more facts about topics which interest him. The validity of such 
measures has generally been unsatisfactory, but in a few cases promising 
results were obtained. The Cooperative Test Service suggests using scores 
on its Contemporary Affairs tests as interest indicators. Knowledge of cur- 
rent developments in politics, social-economic, science-medicine, literature, 
fine arts, and amusements was found to correspond to Strong test scores, 
college majors, and other interest indicators (12). AAF expearience with a 
similar technique is reported on p. 351. 

PREDICTIVE VALIDITY OF INTEREST SCORES 
Stability 

When the reliability of interest scores is studied by immediate retests, 
coefficients are high, as would be expected from the large number of items 
used. A more important question is the stability of scores, since, when we 
measure the interests of a student as a method of vocational guidance, we 
assume tliat he will have similar interests when he becomes a worker. Scores 
on the Strong test are fairly stable. Strong gave his test to Stanford seniors, 
and again 10 years later. Congelations between scores were sizable: .77 for 
Physician, .51 for Personnel Manager, .68 for Sales Manager, .79 for Lawyer, 
etc. (21, p. 361). The interest profiles for individuals were frequently very 
stable; correlations between profiles ranged from .25 to .96 (median .75). 
Only about 3 percent of all A ratings changed to C, and vice versa. For 
high-scliool students, correlations between scores in tentli and twelfth 
grades were somewhat lower, the average correlation being .57 (3). Al- 
though scores are fairly stable, interests depend in part on maturity. Darley 
feels that if allowance is made for the fact that interests are still changing, 
the Strong test may be used effectively by age 15 (7, p. 281. Strong states 
tbat_scctLes.jarfc.de pendable after a ge 17 (21, p. 57). 

Fewer data are available on the'mofe recent Kuder test, but one study of 
adults tested 15 months apart showed correlations as high as .93 (Musical) 
and as low as .61 (Social Service); median, .83 (24). 

Validity for Vocational Guidance 

The evidence that interest test scores do discriminate between men in 
various occupations is partial evidence that interests are a soimd basis for 
guidance. Both the Strong and the Kuder tests have been studied sufficiently 
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to verify that the majority of successful men in an occupation have corre- 
sponding scores on the interest tests. The validity of a scale, however, de- 
pends upon its interpretation. Not all engineers have the same interest. 
Estes and Horn made new keys for the Strong test which distinguished be- 
tween advanced students in civil, mechanical, electrical, chemical, and 
industrial engineering. The correlations between scores on these special 


Table 64. Representative Interest Tests 


Test and Publisher 

Age Level 

Scores Reported, 
Special Features 

Comments of Reviewers" 

Brainard Occupational 

Adolescents 

Seven fields; commercial, 

"In the hands of qualified 

Preference Inventory 

and 

personal service, profes- 

counselors, can be usefiil. 

(Psych. Corp.) 

adults 

sional, etc. These further 
subdivided into areas 

. . . Too much has been 
sacrificed to convenience* 



(e.g., dairying) 

. . . Entirely too brief" 
(Manuel, 2). 

Dunlap Academic Pref> 

Junior high 

Literature, history, etc. Pre- 

"Uses not adequately invest!- 

erence Biank (World 
Book) 

school 

diets school achievement 

gated. . . . Probably not 
usable for describing indi- 
vidual Interests. . . . Con- 
structed with unusual care" 
(Cronbach, 2). 

Interest Index 8.2a, b, e 
(Prog. Educ. Assn.) 

Adolescents 

Liking for sports, fantasy, 
etc. Clinical analysis of 
personality Is made (cf. 
pp.324 ff.) 


Kuder Preference Record 

High school, 

Computational, literary, etc. 

"Quick and economical point 

(Set, Res. Assoc.) 

college 


of departure in helping 
students select occupations" 
(Berdle, 2). 

Lee-Thorpe Occupational 

Adolescents 

Artistic, business, occupa- 

"Acceptable largely because 

Interest Inventory (Collf. 
Test Bureau) 


tional level, etc. 

of the possibilities of using 
test items as leads in In- 
terviewing. . . . Scores 
are of limited value due to 
lack of research" (l 6). 

Pressey Interest-Attitude 

Grade 6 to 

Emotional maturity, meas- 

"Promising. . . . Valuable 

Tests (Psych. Corp.) 

adult 

ured by empirical key 

clues for investigation 
through interviews" 
(Spencer, 1, p. 86). 

Strong Vocational Interest 

Adolescent 

Empirical keys for interests 

"Undoubtedly of value" (4, 

Blank (Stanford) 

and 

similar to engineers, 

p. 459), 


adult menj 
women 

farmers, etc. 

"No adequate Information on 
validity [of women’s 
blank]" (Strang, 1, p. 462). 


" The reader should refer to the source cited for a hiU statement of the reviewer’s ux^iiiion. 


keys and Strong’s Engineer key included .71 (electrical), —.09 (civil), and 
—.63 (industrial) (11). Merely identifying the student as belonging in 
a broad vocational-interest type does not guarantee that he will be equally 
satisfied by all jobs within that area. At the same time, engineering jobs are 
enough alike so that a mechanical engineer would presumably like civil 
engineering better than journalism. 

Interest scores have been correlated with self-estimated interests. On 
the Kuder categories, correlations between scores and self-estimates were in 
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part as follows : Scientific, .48; Musical, .58; Social Service, .39; Persuasive, 
.62, average, .52 (6). This is perhaps most significant as evidence that self- 
reports in response to varied items give a picture that differs appreciably 
from responses to occupational labels. 

.j rhe o nly way to really determine whether guidance based on interest 
tests is sound is to follow persons given sucli guidance imtil' they Have be- 
come settled in their work, and then to measure their job satisfaction. No 
such studies have been made. The nearest approach to such evidence is 
Strong’s follow-up of college graduates. Some of these men changed occupa- 
tion within 10 years after graduation. Sti'ong shows evidence for the follow- 
ing statements: Men who remain in an occupation for 10 years average higher 
scores for that occupation than for any other; men continuing in an occupa-~ 
tion have higher scores in that interest tlian men who try the occupation" 
and change; men who change from one occupation to another charige'ttr 
one in which their interest scores are about as high as for the first choice. 
Interest scores in college have a dear relation to vocation chosen (21). ' 

Carter reviews a number of scattered studies with imperfect criteria and 
concludes (4, p. 32) : “In summary, one may say that there are very few care- 
ful studies which furnish clear-cut evidence of the validity of interest in- 
ventories for the prediction of vocational choice, but that this is due in part 
to the complexity of the problem. There are a variety of reports indicating 
that the majority of practical workers have confidence in the tests. The 
available heterogeneous array of evidence tends to show that the inventories 
are sufficiently valid for use, but the nature of vocational choice is such that 
no single quantitative test of validity is at present feasible.” 

Interest-tests in no way represent ability or aptitude. They deal with only 
one of many factors leading to success in a job. , 

18. Do you agree with the following statement? 

“Insofar as stated choice of occupation by groups of individuals (high 
school girls) may be considered a true criterion of interest, the lack of re- 
lationship between statement of occupational choice and interest scores . . . 
may be considered evidence of the lack of validity of the interest inventories” 
(15, p. 89). 

Validity for Academic Prediction 

Although the Sjtiong-and Kuder keys have low correlations with, grades, 
they -measure a factor of undeniable significance in school success. SeveraT 
studies have shown that they improve the prediction obtainable from intel- 
ligence measures alone. Segel found definite correspondence between in- 
terests and differences in a^ievement between different courses. The cor- 
relation of Strong Engineer interests and mathematics-marks-minus-history- 
marks was .61 (20). Correlations between interest scores and grades in 
engineering and dentistry are low, but in one study of dental students 92 
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percent of those with A and B-h ratings were graduated, compared with 
67 percent of B s and 25 percent of C’s (21, p. 524). 

Greater success in predicting grades has been obtained with specially con- 
_s.truicted-key.s.. There is a "studiousness” key for the Strong blarLk,..based_on 
items whic h distinguished good and poor achievers (27). M osier’s findings 
illustrate results with this scale. Studiousness scores and grades correlated 
.47 for students in liberal arts, .24 for engineering students, and only .05 for 
business administration majors (17). Evidently such a score can be used to 
improve prediction, but its validity must be tested in each new situation. 
A similar technique has been applied to the Kuder test (9). Dunlap has 
devised a test for predicting junior-high-school grades by selecting interest 
items which correlate with achievement in specific subjects. Separate keys 
predict grades in literature, reading, social studies, and so forth. Correlations 
range from .50 to .70 ( 10 ) . 

Validity for Employee Selection 

Strong has made numerous comparisons of interest scores for insurance 
agents with their success. The correlations are about .40, when success is 
judged either by ratings or by business produced. Men with A scores in 
interest produce, on the average, $169,000 per year of new policies, whereas 
C men average only $62,000. A few C men, however, are unmistakably 
succes.sful (21, pp. 486-500). A number of minor studies have found small 
correlations between interest scores on the Strong test and criteria of success 
in other vocations. 

Some efforts have been made to design prediction keys based on items 
which discriminate between good and poor workers. One of the few studies 
to show promising results obtained significant differentiation between 
good and poor salesmen of accounting equipment and between good and 
poor servicemen for that equipment ( 19 ) . 

Davis speaks highly of the Activity Preference Blank used by the Adjutant 
General’s Ofiice for personnel classification. While this test was thought to 
be proof against false self-report, an information test of interests was per- 
haps even better. The Air Force General Information Test (see Table 49) 
measured aviation interest by testing the candidate’s knowledge about fly- 
ing. Such tests, used for predicting success in pilot training, says Davis, 
“were so satisfactory as to imply that this approach to the measurement of 
vocational interests should be more widely used than it has been.” The cor- 
relation of this test with graduation from training ( .51 ) was higher than for 
any other single test used in the entire AAF selection program (8, pp. 54-56) . 

19. What might account for the fact that some men succeed as insurance salesmen 
whose interests are very different from insurance salesmen in general? 

20. How would Sti'ong’s correlation be expected to change if data were gathered 



352 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


on an unselected sample of men, rather than those who were actually engaged 
in insirrancc selling at the time of the study? 

GUIDANCE USES OF INTEREST TESTS 
Interviewing 

Few counselors today are satisfied to make arbitrary predictions on the 
basis of interest tests, Matching a young person to a vocational goal is a 
higMy individualized process in which temperament, previous experiences, 
and the entire picture of interests must be considered. The counselor ordi- 
narily uses one or more interviews to formulate vocational plans. In this 
interviewing, interest tests have proved valuable. An interest test is, in fact, 
an interview, presenting the student with hundreds of questions such as the 
counselor might otherwise ask. The pencil-and-paper questioning conserves 
the time of the counselor for less routine work. 

Having before him the interest test, the counselor can focus the interview 
on areas where clarification is needed. Discrepancies between reported in- 
terests and vocational aims are frequently obseiwed; many men seeking 
guidance indicate engineering occupations as a first choice, while showing 
on the test a great dislike for computational activities. Questioning the client 
as to the sort of engineering he envisions will help to bring to his attention 
the um-ealistic nature of his plans. The interest test also assists in overcom- 
ing the inertia of tlie client who ‘has no idea what he’d enjoy.” After ex- 
pressing in the test his interest in a variety of specific activities, he has a 
basis for considering possible vocations calling for those activities. 

Use for Academic Guidance 

Interest tests serve effectively in guidance programs in schools and col- 
leges. Wlrere these pi’Ograms are understaffed, so that only a small amount of 
time may be given to each individual, the interest test reduces the impor- 
tance of intensive inteiwiewing in the majority of cases. Every student may 
be helped to select a course suited to Iris interests on the basis of a group 
interest test. Attention is drawn to cases in which the interest scores do not 
correspond to the student’s professed aims, so that more careful counseling 
may he used to clarify the ambiguity. 

An example of this use of interest tests is a freshman enrollment plan used 
at a Western college. Every freshman takes the Kuder test, together with 
academic classification tests, during the first day of registration. The test 
scores, together with the student’s transcript and answers to a questioinraire 
about his plans, are given to the faculty member who will plan the student’s 
program. When the student and adviser coirfer, the interests shown on the 
test are compared with expressed vocational plans, and discrepancies are 
clarified or noted for later attention. Courses are then selected to permit the 
student an opportunity to explore and develop his major interests, while 
meeting general freshman requirements. This plan avoids the difficulty that 
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freshmen, asked to state their field of interest in college, frequently give an 
inadequately considered answer and begin concentration in a major field 
that later proves not to satisfy their real interests. 

Interest tests are often useful in a guidance program as a means of open- 
ing counseling. Where counselors are available to give general or specialized 
assistance, it is sometimes difficult to reach those who should be helped. 
Some students are unready to unveil their problem to an unknown counselor, 
and others, while wishing to obtain help, feel their problem to be too minor 
to warrant seeking out a counselor. By ofiPering interest tests to those who 
wish them, it is possible to attract to the counseling office many who would 
not otherwise come but who need assistance. In a sense, it seems naive that 
anyone should wish to take a test “to find out what his interests are,” but 
appetite for such tests is widespread, and those who take the tests welcome 
an interpretation of their scores. Because they come voluntarily, many of 
the usual problems of rapport in counseling are avoided, interest tests have 
^e adv antage over other psychological tests that they have a high degree of 
f ace vali dity, and can be interpreted without oversimplification or distortion. 
The results of interest tests rarely threaten the ego, whereas in many cases 
it would be damaging to offer to interpret intelligence tests or adjustment in- 
ventories to the client without careful preparation in advance. The counselor 
will of course find many openings in the discussion of the interest test to 
probe areas where more complete counseling is appropriate, and may 
wish to introduce other tests later. The interest inventory is largely self- 
explanatory, so that the counselor can lead the student to reach liis own 
decisions on pertinent questions. This conforms to the current belief that 
decisions the person makes for himself are likely to be effective and more 
lasting than those imposed by another’s authority. 

Interests and Personality 

In adjustment inventories the subject frequently conceals his attitudes and 
feelings. The person is usually pleased with and proud of his interests, how- 
ever. EspeciaUy where it is rmderstood that the tests will be used to provide 
the person with activities which will interest him, there is likelihood of hon- 
est self-report. Interests give cl ues r ega rding adjustment and personality. 

Interests may be scored for masculinity-femininity by noting if one enjoys 
the activities normal to his sex group. The woman with strongly masculine 
interests, or the man with strongly feminine interests, may have found diffi- 
culty in adju.sting to normal roles. Mollie Adams likes masculine clotlies, 
affects a terse style of speech, and looks to activities like bowling and sport- 
ing events for recreation. Behind this lies a childhood in which her pleasant- 
est moments were spent with her father, who supervised her development 
and encouraged her to be as rough and ready as the boys of her neighbor- 
hood. Being somewhat insecure in the ‘ladylike” atmosphere that surrounded 
the adolescent girls’ first parties, Mollie accentuated her interest in the less 
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gracious activities where she felt superior and at ease, Other girls may take 
on masculine attitudes because of a desire to prove that they are not men- 
tally inferior, because of hidden aggression against men which leads to a 
desii'e to defeat them, or because of feared inability to compete for the atten- 
tions of men. Analogous motivations frequently account for excessively femi- 
nine interests among men. 

Some interests indicate withdrawing tendencies, and may suggest shyness 
and feelings of social inadequacy. Isolative hobbies, though meritorious and 
often helpful to adjustment, may be of diagnostic significance. Highly intel- 
lectual interests or concentration in some field where one has developed 
unusual competence may be an attempt to withdraw from fields where one 
cannot be sure of superiority. It can be said that exti'eme concentration on 
any interest is symptomatic of need to overdevelop tliat aspect of satisfaction. 

Some persons with conflicts arising from self-criticism find unusual satis- 
faction in activities which others find monotonous. Mathematical and clerical 
work, for example, appeals to some workers who need to be sure that what 
they do is right. Having added a column of figures and checked themselves, 
they can feel an assurance they could never have after writing a story, or 
planning a party, or some other activity where “rightness” is less objective. 
There are others who can be satisfied only when imposing their own indi- 
viduality upon their work, perhaps out of some fear of losing status if they 
are like others. Such people frequently dislike routine or stereotyped activi- 
ties, but respond eagerly to artistic tasks where originality is essential. 

Tliis use of interest tests is an attempt to go behind the interests to deter- 
mine what type of personality would be conducive to such interests, Because 
it relies on the insight of the interpreter, it is open to error ,, . , 
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CHAPTER 1 6 


The Use of Test Results in Counseling 


At this point in our consideration of tests, we have examined all the types 
generally used for educational and vocational guidance. Tests to be consid- 
ered in subsequent chapters are most widely used in research and in clinical 
diagnosis, although in time they may be more used for counseling. This is 
therefore a convenient point to examine the problems tliat arise in the appli- 
cation of tests in counseling. 

RESPONSIBILITY OF THE COUNSELOR 

Special problems arise when tests are used in case work which are not 
important in military screening, employee selection, or research. In most psy- 
chological activities the main responsibility of the tester is to learn the truth. 
The test of a counselor, on the other hand, is whether he promotes the wel- 
.,fare of the client. Even if a counselor uses perfectly truthful techniques, liif 
net effect is harmful if his truth leaves the client permanently less happy. In 
using tests for counseling, the psychologist must be aware of their special 
limitations in case work. 

Unreliability in Individual Prediction 

Even an unreliable test gives results better than chance — in a group. But in 
die individual case, the investigator has only one chance to predict, and one 
error may influence the life of the client drastically. Suppose it has been 
found that an IQ of 115 is needed for probable success in a profession. Even 
if 70 people out of 100 having IQ 110 fail in this field, one cannot make a 
clear prediction for Walter, IQ 110. Perhaps he would do better if tested 
again. Perhaps other qualities unknown to us make Walter one of the 30 who 
would succeed, rather than of the 70 who fail. Rarely are our test s so valid 
that a prediction about a single case is certainly tfue. 

The counselor who is conscious of unreliability adopts many precautions to 
reduce its ill effects. He chec ks each test result against Jbe c ase hi stor y for 
co nsisten cy. In case of doubt, he cor^rms any significant Jesflndings by a 
SMonfcomparable test.. He exami nes his c ase for sp ecial facto i-.s, such as 
language di fficul ty, which migEFmilcethe test invalid. Mn.st important, he . 
flunks oFa t est performance as placing the subject in a probable range of 
scor es, rather than as pegging him ffimly at a.particufar percentiTe. Tests 
rarely ifllss fire in stating that a child is "somewhat, but not extreihely,''below 
average in scholastic aptitude.” The statement that “this child’s IQ is 87” is 
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almost certainly untrue, in the sense that further data would not precisely 
confirm it. 


Misuse of Test Information by Laymen 
Whereas the psycholo gist can qua ltfj^ seores. jis demfi nded hy their in-.. 
.i!ulidity*nnd unreliability, few laymen have this training. They frequently 
o verrat e the power of psychological tools and misinterpret test reports. Both 
clients and such professional workers as teachers, physicians, and social 
workers may place false reliance on test data. Even when the tester’s report 
is carefully qualified, the person receiving the report is likely to remember 
only portions of it. A parent, learning that a child’s IQ is 87, may forget the 
tester’s cautions about what the test does and does not measure, the possi- 
bility of growth or decline in IQ, and the approximate nature of predictions 
from it. Instead, the figure itself may be carried sharply in mind for years and 
used as a basis for rejecting the child or for making significant decisions. 

Gross fallacies are often present in lay thinking about tests. There is the 
ever present belief that the IQ measures native intelligence. There is the 
misquotation that any child of IQ 140 is a genius. Percentiles are misread 
as percents. Norms — what the average sixth-grader does do — are misinter- 
prete as standards — ^what every sixth-grader should do. Most psychologists 
now make it a practice never to give any client or co-worker a test score 
unless he can interpret that score so tliat the person will understand it. 

Information as a Source of Maladjustment 
Test results have emotional significance for the individual. While the sci- 
entist, on his side of the test paper, is being impartial and judicious, tlie 
client, on his side, is reacting with fear, pride, and worry. To the counselor, a 
score at the 75th percentile may look “good.” To the college student who has 
viewed himself all his life as an exceptional person, at the top of the superior 
group, this same finding may be shattering. The psychologist may feel he is 
doing a semce by telling a mother that her child cannot hope to become a 
doctor like his father. To the parents, his statement can be a thi-eat of loss of 
face for having a not-superior child, an accusation of somehow having failed 
as parents, and a deathblow to plans they have been living out for 10 years. 
The counselor may be correct in attempting to give clients realistic ideas, but 
when he unleashes emotional forces, he must accept tlie responsibility of 
helping to resolve the conflicts aroused. 

The counselor who reasons that the truth can do no harm neglects to con- 
sider that the client often distorts the truth to suit his needs. Selective mem- 
ory is commonly noted. Given a large body of facts, the client forgets many, 
but recalls those which have emotional significance for him, Sometimes he 
suppresses and forgets those which are unwelcome, whereas in other cases he 
forgets favorable comments while remaining concerned about some minor 
defect in his record. Rationalization is also common. Given a test score which 
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is not what he would like to heai- about himself, the subject may invent and 
believe excuses for the score rather than treating it as a fact. 

Taking a test is itself an emotional experience, This is shown by the evi- 
dence of stress found in Binet and Rorschach protocols, and by experimental 
measures of emotion during tests ( 8 ) . Most people seek counseling because 
they are uncertain about their ability or adjustment. If they take a test, they 
expect it to give an answer to these doubts, for better or worse. A person who 
fears that he lacks mental ability cannot be expected to face possible proof of 
that fact cahnly. Similar tensions are present when a test indicates fitness for 
a vocation or degree of personality adjustment. In taking most tests, there are 
moments of failure. These arouse emotion. Many subjects who can do well 
and whose scores are high relative to the group are upset and discouraged at 
the end of a test because they are conscious of their errors and failures. This 
is a necessary consequence of tests designed so that the average person will 
miss or fail to complete many items. 

Unfortunately, problems are not eliminated by reassurance. The tester who 
tells the subject that he has done well does not remove doubts from his mind. 
If a subject has a thoroughly integrated feeling of inferiority, change in his 
attitudes must come through his readjustment, rather than through the assur- 
ance that an outside authority thinks him good. Because feelings of insecurity 
have usually been present for years, perhaps stemming from a felt lack of 
acceptance in infancy, they are not to be removed by a single “fact.” 

The problems stated in this section define the responsibilities of the coun- 
selor. T liere is no c lear set of mjes for avoiding thejlifiBxmlties.suggeste.d..The 
counselor who is aware of the pitfalls may use tests more, wisely in a particu- 
lar situation. 

1. Discuss the advantages and disadvantages of telling parents the intelligence lest 
scores of their first-grade children. 

2. Discuss the advantages and disadvantages of telling college freshmen their 
standings on the American Council Psychological Examination. 

3. Discuss the advantages and disadvantages of giving the Bernreuter Personality 
Inventory to a college psychology class and permitting each person to score his 
own paper. 

4. In each of the above cases, if it were decided to give out the information, what 
precautions would be helpful? 

NON-DIRECTIVE OR CLIENT-CENTERED COUNSELING 
The Non-Directive Point of View 

In earlier days of psychological service, the counselor was often viewed as 
an expert passing judgment, as an engineer would inspect a bridge or a 
physician prescribe for a disease. More recently attention has been drawn to 
a counseling relation in which the counselor does not decide, direct, or 
advise, but rather helps the client thin k for himself. In cYtremply dirqnffyp 
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or pr^criptive counseling, the “expert” obtains facts, decides, and tells the 
cli ent w hat to do, More client-centered counseling stresses the importance of 
the clie nt’s 'making his owj) decisions and accepting himself and his char- 
acteristics. The client-centered point of view proposed by Rogers (5) has 
been a controversial topic, but most counselors have found his suggestions 
acceptable and desirable at least in part. 

Rogers’ basic view is that the important goal is the growth of the client 
toward maturity and adjustment. A person who has learned to face his prob- 
lems and rely on his own judgment has been helped more than one who mjist 
seek advice in each new crisis. 

Expert advice often fails to remove problems because psychological prob- 
lems are entangled in emotional attitudes. The true problem is often not the 
surface problem voiced to the counselor. Suppose Stan Howard, an employee 
on a finishing machine, comes to inquire why he was not promoted to fore- 
man. The directive personnel worker might give him the facts, based on tests 
and ratings, which “prove” that Howard would be a poor foreman. He may, 
even give Howard a pep talk on how well he produces, about the chance of 
raising his pay as a workman, and about the undesirability of seeking a job 
where he would fail. Howard is likely to nod his head and leave. But he may 
be far from convinced he should not be a foreman. He may now, in fact, quit 
and go to another company where he’ll “have a chance.” He may have failed 
to state, or even recognize, that he is anxious to be a foreman solely because 
his brother-in-law is a foreman and he wishes equal status. Similar “irrele- 
vant, non-objective” factors often lie behind the student who studies inade- 
quately, the aircrewman who strives to be a pilot, the mother who overrates 
her child’s ability, or the unpopular girl. Even when he seeks counseling, the 
client plirases his problem to protect the tender spots of his ego. The cou n- 
selor who “solves” a surface problem may be helping the client to avoid f^- 
~ing hi s real co nflicts. 

The meffioHs sugg este d by Rogers help die client face his tanie.feelings., 
The Munselpr listens, accepts, and reflects the client’s feelings by rephrasing 
what the cl ient has said; “You think you’d rather be a foreman than a ma- 
ohine operator”; “It’s discouraging when a man who came after you did is 
promoted over you”; “You feel that the management doesn’t trust you. " By 
accepti ng the f eelin gs instead of trying to prove^them false, Rogers finds 
Kiinself able to promote ultimate adjustment. The dient , freed from need to 
jus®^ and apologize for his attitudes, gains insight into Kimself. Decisions 
made atidn's level sti ck because the client believes them. 

Tlie client is made responsible for the treatihent. He asks the questions, 
limits the area discussed, makes die judgments, and decides when to termi- 
nate the counseling. If the counselor proposes a test, or suggests that poor 
reading may be a source of difficulty, or lays down alternative solutions, he is 
taking responsibility. The rejection of these methods is Rogers’ most novel 
and most crucial point. 
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5. To what extent are these statements non-directive (insofar as can be judged 

without knowing the client’s preceding remark) ? 

a. Tell me about yoin background so I can understand this better. 

b. Here is a hst of books from which you may wish to read something on your 
problem. 

c. You would like to get the opinion of another person. It seems too important 
to risk a wrong decision. 

d. Fortunately your previous record shows that you can do this well. 

e. I’ll be free tomorrow at 10 if you wish to come in again. 

Tests in Non-Directive Counseling 

To a large extent tests, which are designed to help the tester be wise, 
become of secondary importance because they do not center on the feelings 
of the client. Says Rogers (6, p. 140) : 

“The counseling process is furthered if the counselor drops all effort to evaluate 
and diagnose and concentrates solely on creating the psychological setting in which 
the client feels he is deeply understood and free to be himself. It is unimportant 
that the counselor know about the client. It is highly important that the client be 
able to learn himself. (Not to learn about himself, but to learn and accept his own 
self.) In making use of these principles the counselor examines his own attitudes 
and techniques and endeavors to refine his procedures so as to eliminate all which 
are not in accord with the basic principles. Thus questions are eliminated fro m the 
teterxiew because they invariably direct tlie nnnvnfsnt'q'i,.. advice 1*= ^ninateTf be~ 
_c ause it assumes tIT fi.'miihserdr-tQi>e.llie. xcspaasib le person, diagnosis and evalua- 
tion are put aside because it has been learned that even when they iue not voiced 
they tend to distort the counselor’s responses in subtle ways and to break down his 
full acceptance of client attitudes.” 

Tests are -iiot abandoned b y this outlook . Instead of becoming ways of 
diagnosing the client. Teaming the answers so we can tell him, thej^ become 
w^y-?.of.. helping him find out a bout himself. In non-directive c ounseling, tieif? 
enter only when the ^ient aSks for theiji^Uhe client who comes in with the 
statement, Tm worried because it takes me so long to learn an assignment,” 
is not immediately seated before a battery of tests. Instead, counseling may 
go through to completion with no use of tests. Perhaps, in the course of ex- 
amining his difiiculties, he says, “I’ve often worried about whether I’m as 
bright as these students I compete with. I thought you people had some 
tests that would tell about that.” Then the counselor can supply him with 
the means of measuriirg himself, since he has apparently reached the ma- 
turity required to face his question honestly. 

^t iji. the. repeated experience of counselors trying this technique that when 
test s are d elayedj,prdb lems-eome,Jtoli ght which would otherwise ne ver ha ve 
]| be5ii voice3. A student may come to request an intelligence test. If given the 
test, told that his score is normal, and dismissed, he has been reassured but 
not necessarily helped. In taking the test he may have reduced his tension 
temporarily, but may have failed to solve the basic conflict that set him 
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wondering about intelligence. Perhaps he was worrying about changing his 
major, or perhaps he is concerned because his grades in college are lower 
than in high school. Or the problem may be as remote from that stated as a 
worry because his wife’s family considers his pronunciation peculiar. Insofar 
as the counselor avoids methods which bring the conference to a head, give 
an answer, and terminate it, he permits the client to dig into what really con- 
cerns him. 

Most counselors compromise with the non-directive approach to some de- 
gree because of administrative conditions or other reasons. Even where the 
approach is not purely non-directive, emphasizing the client’s responsibility 
is a helpful technique. Thus Bordin and Bixler place on the client the re- 
sponsibility of choosing the tests to be taken. In contrast to establishing a 
standard battery of tests to be taken, they answer questions about tests by 
discussing at length the sorts of tests available. They neither, recommend a 
p articuh u:J:est.mor- limit- their descriptions to the tests the client asks ab.out- 
AftetJiearing..what tests can be had, the client takes the initiative in deciding,, 
wh at to find out about liimself (2). 

Interpreting Tests Non-Directively 

Decisiamjpade by the coun selor are apparently less effective for most 
dlients tha n those_ the.v. make, themselves. The counselor who knows what test 
results imply can help the client most if he helps him learn their meaning 
and reason out his own decision, rather than following the medical practice 
of writing a prescription but concealing the diagnosis. Bixler and Bixler have 
made numerous suggestions for applying the client-centered point of view to 
test interpretation ( 1 ) . 

The counselor avoids the givii^ of opinions. The counselor is always 
tempted to comment on the goodness or badness of scores, if only as a means 
of giving confidence or of emphasizing the seriousness of symptoms. Such 
evaluation comes between the client and the score and makes it harder for 
him to accept the score as a reality. Bixler urges a statistical prediction, in 
such form as “Eighty out of one hundred students witli scores like yours 
succeed in business.” Tliis is easier for the client to interpret than merely, 
“You are at the sixty-fourth percentile.” The writer has found it an even 
better device to show college freshmen a scatter diagram plotting the test 
scores and grade averages of hundreds of cases. The client can judge the 
“success” of previous students by looking at the column of the chart where 
scores equal to his are plotted. The graph may show some cases making A’s 
and some making failures. He can see for himself whether with his score he 
is likely to attain his aspiration, Thi s procedure also rein£ orces-the .concept 
of nur(i>linbiiiiyju-predictinTi. The §cattgj.jDf cajsfiaimke&itxleax-tbat his 
is-n ot certain. 

Bixler’s second suggestion is that the counselor .should be frank, although 
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using general terms rather than specific scores. Low scores must be faced 
honestly, if the client is to gain in self-knowledge. But this can only come 
when he expresses readiness for the knowledge. 

Personality tests are used primarily as a basis for helping the individual 
view his own feelings. Instead of, "Your home adjustment is low,” Bixler says, 
"You seem to feel that you have more difficulties at home than in your social 
living.” This invites the client to respond with more expression and recogni- 
tion of feeling, rather than taking the score as a terminal diagnosis. It is 
especially necessary not to label a client maladjusted or neurotic or inti-overt. 
Such a label is threatening and carries no implications for constructive de- 
cisions. 

The -cHent must always fee.1 free_to xeject nuy interpretation. JHe must be 
able to say that, though his score is low, he expects' to succeed. He must 
be able to reject his own interest test score by insisting that he really likes 
engineering despite a low computational interest. It is only when he learns 
that he need not argue witli the counselor that he feels free to examine him- 
self non-defensively. 

The counselor must be a war e..of and help the client recognize his eniOi 
tional reactions to the tesf icores. It is eihotional reactions that block rational 
thinking, so that tlie~client can use the scores wisely only after he has come 
to an understanding of his emotions. 

These points are illustrated in the following dialog published by Bixler 
from a case record ( 1, pp. 151-152) : 

“Counselor — Sixty out of one hundred students with scores like yours succeed in 
engineering. About eighty out of one hundred succeed in the social sciences. . . . 
The difference is due to the fact that study shows the college aptitude test to be 
important in social sciences, along with high school work, instead of mathematics. 

“Student — But I want to go into engineering. I think I’d be happier there. Isn’t 
that important too? 

“C. You are disappointed with the way the test came out, but you wonder if your 
liking engineering better isn’t pretty important? 

“S. Yes, but the tests say I would do better in sociology or something like that. 
(Disgusted) 

“C. That disappoints you, because it’s Qie sort of thing you don’t like. 

"S. Yes. I took an interest test, didn’t I? What about it? 

“C. You wonder if it doesn’t agree with the way you feel. The test shows that 
most people with your interests enjoy engineering and £ire not likely to enjoy social 
sciences — 

“S. (Interrupts) But the chances are against me in engineering, aren’t they? 

“C, It seems pretty hopeless to be interested in engineering under these condi- 
tions, and yet you’re not quite sure. 

"S. No, that’s right. I wonder if I might not do better in the thing I like — Maybe 
my chances are best in engineering anyway. I’ve been told how tough college is, 
and I’ve been afraid of it. The tests are encouraging. There isn’t much difference 
after all — Being scared makes me overdo the difference.” 

6. What further insights might the client gain from continuing the conversation? 

What would be a suitable client-centered remark for the counselor to make next? 
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7. Which of the following statements would it have been legitimate for a non- 
directive counselor to make? 

a. It’s probably better for you to work in an area you like than to follow these 
tests strictly. 

b. Studies show most people develop an interest in areas they do well; you 
probably would learn to like social science if you tried it. 

c. If you stay in engineering, you should plan to take a course in remedial 
mathematics. 

d. It seems to you that it’s wisest to work in the field where your chances are 
best. 

8. Reread the counselor’s remarks carefully. Did he at any time suggest what he 
thought was right, or what he approved of? Did he disapprove of any idea of 
the client? 


PRESCRIPTIVE COUNSELING 
The Directive Point of View 

Although emphasis has been placed on permissive counseling above, it 
should not be-assumed_ that prescriptive methods are obsolete. They arfi... 
w jdely used under many circumstances. Some counselors .prefer .them. Ad- 
ministrative requirements often force a counselor to be the responsible per- 
son, as when a veterans’ counselor is required by law to approve the voca- 
tional plans of certain men. When a case is referred for counseling, rather 
than coming-in voluntarily, the counselor cannot apply client-centered meth- 
.odsJully^ases in wliich the client is incapable of self-direction, as in psy- 
chosis and sever e neurosis, must also be prescribed for. 

Those using tests directively emphasize the importance of “objective facts” 
as a basis for rational decision, in contrast to Rogers’ emphasis on the emo- 
tional meaning of the “facts.” The dii-ective counselor often thinks of the 
client as leaning on someone for support, and considers tests an especially 
sound basis for giving the direction sought; in other cases, the counseling is 
viewed as a problem of convincing the client that his plans should be 
changed, and tests are regarded as a forceful type of evidence (7). The 
counselor who wishes to bring his client to face the facts takes a stand similar 
to John Dewey’s (paraphrased here from a passage dealing with children) 
(3,pp.84-^5): 

The suggestion upon which clients act must in any case come from somewhere. 
It is impossible to understand why a suggestion from one who has a larger experi- 
ence and a wider horizon should not be at least as valid as a suggestion arising from 
some more or less accidental source. It is possible of course to abuse the ofSce, and 
to force the activity of the young into channels which express the counselor’s pur- 
pose rather than that of the client. But the way to avoid this is not for the counselor 
to withdraw entirely. . . . The counselor’s suggestion is not a mold for a cast-iron 
result but is a starting point to be developed into a plan through contributions from 
the experience of all engaged in the counseling process. 

Prescriptiv e, counselors generally organize their work to obtain as vyide a 
variety of information about the client as possible, make a wise interpreta- 
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tion, and bring the client to base his action on this information. While they 
respect the right of the client to choose between alternatives of merit and do 
not force even a wise course of action upon him, their emphasis is on keepii^ 
fhe client from making errors. Says Williamson, who represents a “human 
engineering” point of view (9, pp. 134-138) : 

"The effective coimselor is one who induces the student to want to utilize his 
assets in ways which will yield success and satisfaction. . . . Ordinarily the coun- 
selor states his point of view with definiteness, attempting through exposition to 
enlighten the student. ... In respect to no student’s problem does the counselor 
appear indecisive to the extent of permitting loss of confidence in the authority of 
his information. . . . If it is true that die counselor should not make the student’s 
decision, it is equally true that someone must render this very seiwice until some 
students are able, intellectually and emotionally, to think for themselves.” 

Test Batteries for Preliminary Diagnosis 

Some testers have a battery of tests wliich serves as their first method of 
getting acquainted with the client. For Veterans Administration counseling 
centers a common battery consists of a shoil: ability test, such as Otis Gamma, 
an achievement measure for those considering fui'ther schooling, the Kuder 
or Strong interest tests, and the Bell Adjustment Inventory (Adult). For 
those who are concerned with mechanical occupations, tire Bennett test and 
various pencil-paper and dexterity tests may be used. Upon completion of 
the first tests and perhaps a preliminary interview, further tests may be called 
for, such as a reading measure, a Wechsler test, or aptitude tests. Final de- 
cisiorr is based on a case history and a consideratiorr with the client of his 
record. The program aims to provide the veteran with advice from a trained 
person who knows the wide range of occupations, the possibilities of furtlrer 
schoolirrg and of psychological assistance, and the ways of diagrrosirrg abili- 
ties and problems. The couirselor is a skilled teacher, who must help the 
client acquire needed information as well as emotional attitudes. 

In contrast to the standard-battery approach to diagnosis, many counselors 
feel that the best use of tests is to make a careful preliminary survey of the 
problem, before choosing particular tests to answer questions specific to the 
case. Thus one mental test might be chosen for one case, and not used again 
in months of counseling. Doll makes this comment regarding clinical diagno- 
sis (4, p. 168): 

“Every case study originates in some social situation. This is commonly referred 
to as the immediate problem or the ‘complaint.’ Sound casework requires that this 
immediate problem be formulated as explicitly as possiWe, and that it be verified 
in fact as may be practicable by objective procedures before proceeding with the 
clinical analysis of the problem. The delineation and verification of the initial 
problem requires a review of the individual’s history in all aspects and this review 
is most serviceable when obtained prior to the use of test procedures. One of the 
most serious eiTors in current clinicm-psychological practice is to proceed with test 
devices long before the immediate problem is specifically assured and before the 
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antecedent history of the individual has been adequately explored. . . . The only 
sound position that can be taken by any competent clinician is to hold fire on test 
procedm'es until these preconditions of a satisfactory mental diagnosis are well 
established. Only by this approach can tlic clinician decide which tests are relevant 
in a given case and how many or of what variety. Only under these circumstances 
can he economically and quickly determine the direction of test analysis. And 
only from this base can he properly interpret the data derived from applying 
clinical-psychological test procedures with assurance of a harmonious coordination 
of these data with the entire clinical picture.” 

9. In each of the following situations, discuss the advantages and disadvantages 

of a standard battery given before the interview, compared to tests chosen 

on the basis of preliminary study of the case. 

a. A school psychologist works with children of junior-high-school age re- 
ferred by their teachers as problem cases. The psychologist is to recom- 
mend suitable ways of dealing with the child. 

b. A clinical psychologist is to examine all men reporting to a veteran’s 
counseling center who are suspected of serious emotional disturbances. 

c. A remedial-reading instructor in a college makes a study of all cases re- 
questing help to determine if a reading difficulty is present. 

d. A vocational adviser is employed by a small city school system to test all 
pupils and make information regarding their aptitudes and interests avail- 
able to the teachers who will do tire counseling. 

10. In the following situations, decide whether the welfare of the client would 

be promoted by giving him the information that tests could provide. 

a. High-school records of a college freshman suggest tliat he is probably so 
lacking in scholastic aptitude that he is more likely to fail than not. 

b. An engineering student seeks advice because of his poor grades. A tele- 
phone call to his instructor discloses a probable deficiency in arithmetic. 

c. A worker protests Iris non-promotion to foreman. Test records on file .show 
that his intelligence is below that of successful foremen. 

CAUTIONS IN COUNSELING 

In helping the client make decisions, the counselor, whatever his tech- 
nique, wislies,^:fl)ifi_p3ient.t0 .haYe a basis for optiinism regarding the future. 
Tlie n on-directive counselor woul d prefer, that this come through insight, 
whereas the, direfttiva counselor tends to giye^ direct , encouragement. , In 
either case, however, the client should leave the counseling with a positive 
plan for action, rather than merely the knowledge that his former plan was 
inadequate. Similarly, he must have a feeling that he has some strong quali- 
ties, rather than a total feeling of failure because tests have brought to light 
only weaknesses. In every test performance, there are some praiseworthy 
aspects. The counselor who wi.shes to give support will call attention to such 
features as accuracy, originality, or persistence, in addition to giving the 
client facts about his score. Nearly all counselors worki ng with normallate 
adolescents and adults agr ee~in giving, tlie client the facts on which reepm- 
•jrmfptgTTnnr wbp ref uSe S tO give SCOreS, 

^en in gener al form, s ets up a fear in the client that he vvas not told because 
his scores were too poor. 
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The most helpful single principle in all use of testing is that test scores are 
merely data on which to base further study. They must be coordinated with 
background facts, and diey must be verified by constant comparison with 
other available data. This is the reason that continued counseling by an 
adviser over a year is more effective than “one shot” counseling, where an 
answer is given to each new specific problem by a different adviser. The 
test score helps the counselor by warning him to look in the record for fur- 
ther symptoms of a particular problem. The score, and study of items within 
the tests, suggest topics to probe by interview metliods. While sometimes it is 
necessary to act on a problem immediately, it is sound practice to defer a 
final decision as long as possible, meanwhile seeking confirmation of 
tentative diagnoses. 

Wherever possible, test re sults should be a starting, ijoint for action. J.hey 
should n ot be an end in themselves. There is some satisfaction in being di- 
agnosed and evaluated, but counseling is helpful only when behavior is 
changed. Test results may be used to encourage analysis of study habits, vo- 
cational exploration, or activities' leading to personality development. Tests, 
at tlieir most effective , are . ways of helping people li ye jnnre happil y. 


11. Discuss the advisability of delaying final decision in each of these situations. 
What supplementary information should be sought to confirm the tenta- 
tive conclusions? 


a. A college student who is failing in engineering at midterm seeks a more 
suitable vocational goal. Aptitude and intere.st tests suggest journalism. 

b. An engaged couple, after a quarrel, seeks the help of a marital counselor. 
A personality test intended to predict marital adjustment (validity .50) 
shows that their score as a pair is low, in the range where there is an even 
chance of divorce. 


c. Students applying to enter a graduate school for social work are tested 
routinely. A girl shows severe neurotic signs on both self-report and the 
Rorschach test (Chap. 20). 


SUGGESTED READINGS 

Bixler, Ray H., and Bixler, Virginia H. “Test interpretation in vocational coun- 
seling.” Edtic. psychol. Measmt., 1946, 6, 145-155. 

Bordin, Edward S., and Bixler, Ray H. “Test selection; a process of counseling.” 

Ediic. psychol. Measmt., 1946, 6, 361-373. 

Combs, Arthur W. “Non-directive techniques and vocational counseling.” 
Occupations, 1947, 25, 261-267. 

Rogers, Carl R. “Psychometric tests and client-centered counseling.” Educ. 
psychol. Measmt., 1946, 6, 139-144. 

Williamson, E. G. “Counseling and the Minnesota point of view.” Educ. psychol, 
Measmt., 1947, 7, 141-155. 

REFERENCES 

1. Bixler, Ray H., and Bixler, Vii'ginia H. "Test interpretation in vocational coun- 
seling.” Educ. psychol. Measmt., 1946, 6, 145-155. 



THE USE OF TEST RESULTS IN COUNSELING 


367 


2. Bordin, Edward S., and Bixler, Ray H. “Test selection: a process of counsel- 
ing.” Educ. psychol. Measmt, 1946, 6, 361-373. 

3. Dewey, John. Education and experience. New York: Macmillan, 1938. 

4. Doll, Edgar A. “The social basis of mental diagnosis.” J. appl. Riychol., 1940, 
24, 160-169. 

5. Rogers, Carl R. Counseling and psychotherapy. Boston: Houghton Mifflin, 
1942. 

6. Rogers, Carl R. "Psychometric tests and client-centered counseling.” Educ. 
psychol. Measmt., 1946, 6, 139-144. 

7. StafE, Advisement and Guidance Service, Veterans Administration. "The use 
of tests in the Veterans Administration counseling progi'am." Educ. psychol. 
Measmt., 1946, 6. 17-23. 

8. Waite. William H. “The relationship between performances on examinations 
and emotional responses.” /. exp. Educ., 1942, 11, 88—96. 

9. Williamson, E. G. How to counsel students. New York: McGraw-Hill, 1939. 



CHAPTER 17 


Self-Report Techniques: Attitudes 


ATTITUDE SCALES AND OPINION SURVEYS 

Political and social beliefs have been an important subject for investigation 
by psychologists and sociologists. Some studies have been primarily con- 
cerned with surveying tire opinions of the public in general or of special 
groups. Opinio n surveys such a s the Gallup poll place little or no emphasis 
^i^r eliab le measuiament of in dividuals. In contrastj attit ude scales attemp t 
to desci'ibe reliably the attitude of each individual,-'. although they may also 
be helpful for group surveys. PoUs are most often conducte d bv inte rviewing- 
whereas the attitude acale. is a paper^pencil test. Where opini on surv eys may 
deal with miscellaneous questions, asking only one question on each topicj 
atytiijtle-scales,.usually concentrate on one issue or group of issues. Like other- 
tests, the attitude scale combines many questions ineasuring the s ame factor 
in order iQ. obtain a.r,eliable score. Opinion surveys try to repor^olitical and 
business trends for tlie guidance of policy makers. 

Attitude scales have geiierally been Jl&ed for more intensive researdi. The 
large number of questions makes the scale less satisfactory for widespread 
and practical use. The refined measurement permits one to correlate indi- 
vidual attitude scores with other factors and to detect small shifts in score. 
Attitude scales_are..suited to testing^ the effect of education or propaganda 
upon beliefs . The wide range of use of attitude scales is suggested by fEe" 
variety of attitudes measured with scales now in existence. These include 
scales dealing with Russia, Negroes, communism, birth control, conserva- 
tive vs. untraditional educational practice, labor unions, and satisfaction in 
one’s job. 

1. Design an experiment to test whether newspaper editorials change the beliefs 
of a community toward the United Nations. How would the procedm-es of 
obtaining data be changed if an attitude scale were used for measurement 
rather than a smgle-que.stion opinion survey? 

2. Discuss whether an opinion sm-vey or an attitude scale would be more suit- 
able for investigating each of these problems. 

a. A motion-picture producer wishes to know which of two titles will be most 
appealing to the public. 

b. A manuflicturer wishes to know whether the majority of his men favor 
a group medical insurance plan. 

c. An investigator wishes to compare the attitude toward free enterprise of 
students majoring in different college departments. 

d. A social service administrator wishes to determine the attitudes of his staff 
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toward certain social groups as a basis for assigning them to duties in viui- 
ous sections of a large city. 

ATTITUDE-TESTING TECHNIQUES 
The Thurstone Method 

The majority of attitude scales represent either the Thurstone or the Likert 
method of scale construction. Tlie Thursto,ri.e_method is iraportaiit_bQthJbfi- 
cause o t its wide use and because it represents a major advance in test the- 
tgy7~Pirior to^ the development of Thiurstone’s technicjue in 1928, attitude 
“measurement” had been largely coriBned to siny^ "^questionnaires. Among 
the weaknesses of suclr procedures was the lack of evidence that the separate 
questions measured the same attitude, and the arbitrary units of measure- 
ment employed. 

While one could obtain numerical scores by counting answers, there was 
no assm-ance that the differences between scores 40 and 50, say, represented 
the same differences in attitude as the difference between scores 70 and 80 , 
Thurstone’s method, derived from psychophysical techniques, sought to ar- 
range people on a continuous scale having equal-appearing units. Then, if 
Charles had a score midway between Alfred and Bill, one could say that his 
attitude was midway between the other two. Thurstone conceived of atti- 
tudes toward any object as being arranged on a single continuum, from 
highly favorable toward the object to highly unfavorable. An ideal scale 
would indicate just how favorable or unfavorable each subject is. 

A Th mstone attitude scale consists of 20 or-more-stat em e n fa i . r epresenting 
all degrees of opjnioh .. The subject indicates with which statements he 
a^eesTl^ch statement has been assigned a scale value, ranging from 0.0 
for the most exti-eme statement possible in an unfavorable direction, through 
5.5 for neutral statements, to 11.0 for the most extremely favorable statement 
possible (33). Representative statements, with their scale values, from the 
scale for measuring attitude toward communism' are as follows: 

Communism is the solution to our present economic problems (9.1). 

Give Russia another twenty years or so and you’ll see that communism can be 
made to work (8.4). 

Workers can hardly be blamed for advocating communism (7.0). 

Both the evils and benefits of communism are greatly exaggerated (5.4). 

I am not sure that communism solves the problems of capital and labor (4.7). 

If a man has the vision and the ability to acquire property, he ought to be al- 
lowed to enjoy it himself (3.3). 

Police are justified in shooting down communists (0.3). 

Tire score for the subject is the median scale value of the statements he 
checks. 

In effect, the subject matches his opinion with the scale of opinions, select- 
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ing his own position in tlie way that he might match a sample of gray paper 
with a series of samples ranging from black to white. This, at least, is the 
theory, but in practice few attitudes are as pure as the black-white con- 
tinuum. As Murphy, Murphy, and Newcomb state ( 23, p. 897 ) : 

“No scale can really be called a scale unless one can tell from a given attitude 
that an individual will maintain every attitude falling to the right or to the left 
of that point (depending on how the scale is constructed). ... If we liavo a 
genuine scale of attitudes toward c-ompulsory military service, it is necessary to 
assume that every per-son favoring compulsory military service should also favor 
military service for persons conscientiously opposed to .such service; the former 
proposition presupposes the latter. ... As a matter of fact there is every reason 
to believe that none of the rather complex social attitudes . . . will ever con- 
form to such rigorous measurement.” 

Guttman has developed a statistical theory for devising true scales, provided 
that the attitudes under study admit of such treatment ( 15 ) . His method has 
so far had only limited application. 

3. Interpret the following responses to the Thurslone scale on “attitude toward 
capital punishment.” (Under the directions most often used, subjects check a 
larger number of statements than indicated here.) 

a. Mr. Adams checks statements with scale values 6.2, 7.2, 7.3. 

b. Mr, Brown checks statements with scale values 7.3, 8.4, 9.0, 9.4. 

c. Mr. Clark checks statements with scale values 5.3, 5.4, 5.7, 7.2. 

d. Mr. Dean checks statements with scale values 2.0, 6.2, 7.2, 7.3. 

4. Prepare a set of seven statements reflecting approximately equally spaced 
positions along the attitude continuum for "traveling by commercial airlines” 
(as opposed to other modes of transportation). 

5. Do the following constitute a true scale? "I favor no compulsory military 
training”; “I favor one year of compulsory military training”; “I favor two 
years of compulsory military training.” Could a man logically favor two years 
of training but disapprove of one year of it? 

There are three steps in Thimstone’s technique of scaling: preparation of 
possible items, sorting by judges, and testing for relevance. Possible items 
are obtained by collecting opinions from writers and laymen reflecting all 
.shades of belief. Each item is placed on a slip of paper. The entire pack of 
slips is then given to a judge who places the .slips in eleven piles, ranging 
from one extreme to the other in approximately equal steps. This sorting is 
repeated by 50 or more judges. The equal. iiitfir,yals in. the final scale a re dif- 
ferences in attitude that appear equal, to the judges. After a statement has 
been scaled by a number of judges, it is discarded if judges disagreed mark- 
edly in sorting it. For statements retained, the scale value is the median posi- 
tion assigned by the judges. 

Statements chosen are now placed in a preliminary check list. This list is 
given to subjects as a means of checking the relevance of items. The method 
used resembles the internal-consistency test described on p. 78. The scores 
of those who check a given item are tabulated. If these persons generally 
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have the attitude represented by the item, it is a good item. But if a favorable 
statement about communism is checked by many people whose general atti- 
tudes are unfavorable, the item does not measure the same factor as the rest 
of the test. The statement, “Communism has done many valuable things for 
the Russian people,” might be checked by some who oppose communism. 
The statement perhaps measures “attitude toward communism for the Rus- 
sians,” where the rest of the scale emphasizes “attitude toward communism 
for this country.” Items which are frequently checked by people whose atti- 
tude score is far from the scale value of the item are discarded. The final 
scale consists of items which have passed the tests for ambiguity and rele- 
vancy. Final items are selected to represent the entire range of attitudes. 

Remmers has devised a generalized scale to avoid the labor of te.sting items 
each time a new scale is desired. He constructs a single scale for measuring 
“attitude toward any race” or “attitude toward any institution.” Such state- 
ments as “This race has contributed a great deal to human progress,” or “This 
race is no more deceitful than other people” can be scaled as usual. The 
resulting scale, with appropriate directions, measures attitude toward Ger- 
mans, Chinese, or any other group. Substantially the same results are ob- 
tained with Remmers’ scales as witlr scales constructed about a specific topic 
(26). Some of the generalized scales lead to absurdities when applied un- 
critically. One reviewer points out that the statement, “This institution does 
not consider individual diflierences,” becomes meaningless when the institu- 
tion being rated is marriage ( 21, p. 305). 

T hurstone-ty pe scales are reliable, with coefficients of equivalence gen- 
erally near .8 5. This is quite high for such brief tests. Subjects are consistent 
in their self-reports. Tliis consistency is present even when tlie specific state- 
ments are changed from form to form. 

Stability of scores is not so high. Corey found that college students, re- 
tested after one year, changed in score considerably. The correlations be- 
tween first and second test ranged for .28 to .61 (median .49). Coefficients 
of equivalence (split-half) for the first testing were between .35 and .82 
(median .71) (8). 

6. The split-half coefBcients reported by Corey are much smaller than those 
found by other workers for the same scales. What factors might account for 
this? 

7. If a test validly measures attitudes, should scores obtained a year apart have 
high correlations? 

The Likert Technique 

The Jjikert procedure for constructing attitude tests is increasing in use. 
Tlie Likert-type scale resembles simple ques fidnhaifesV except that more-TS^ 
finedTechniques of item selection improve the instrument (19). Each scale 
.iS-a_series of statements, ranging from as few as 18 items to as many as 200. 
Each statement in a Likert scale is either definitely favorable or definitely un- 
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fayorahle to the objeci- of the scale. The subject indicates his teaetion. to. 
each statement, usually on a five-point scale:, strongly agree, agree, un- 
decided, disagree, strongly disagree. These answers are credited 5, 4, 3, 2, 
and 1, respectively, for favorable statements; and 1, 2, 3, 4, and 5, respec- 
tively, for unfavorable statements. A favorable attitude is shown in a high 
score. Scores are interpreted on a relative basis since there is no absolute 
system of units such as Thurstone’s. 

The first step in the Likert procedure is the collection of possible state- 
ments. These are administered as a trial test to many subjects, These papers 
are scored and each item is correlated with the total test. If an item is not 
correlated with the total scale, it is discarded. This internal-consistency pro- 
cedure eliminates ambiguous items and those not of the same type as the 
rest of the scale. 

One widely known Likert-pattem scale is the Minnesota Survey of Opinion 
(which is not an opinion survey of the type described on p. 368). This is a 
set of six scales devised to study morale during the depression of the 1930’s 
(28). Each statement in the scale reflects an optimistic or pessimistic out- 
look, the separate sections dealing with morale, inferiority feelings, family 
adjustment, attitude toward the law, economic conservatism, and attitude 
toward education. Items from tire respective scales include: 

SA A U D SD It is diflicult to think clearly these days. 

SA A U D SD It is diflicult to say the right thing at the right time. 

SA A U D SD A man should be willing to sacrifice everything for his family. 
SA A U D SD A person should obey only those laws which seem reasonable. 
SA A U D SD Legislatures are too ready to pass laws to curb business freedom. 

SA A U D SD A man can leam more by working four years than by going to 

high school. 

The coeflBcients of equivalence of the scales ranged from .79 to .85. Darley 
found that the coeflScients of stability were lower — around .60 when the tests 
were given about nine months apart ( 10) . 

Several sets of tests have dealt with conservative-liberal opinions on promi- 
nent issues. The Scale of Beliefs of the Progressive Education Association 
measures high-school students’ reactions to issues dealing with labor, eco- 
nomic policies, race, nationalism, militarism, and democratic practices (29, 
pp. 203-244). These scales had coeflScients of equivalence (Kuder-Richard- 
son) ranging from .79 to .86 for high-school groups, with a coefiicient of .95 
for the total Liberalism score. The Scale of Beliefs includes a device for 
measuring the consistency of attitudes. Each liberal statement in the test 
deals with the same issue as one conservative statement elsewhere in the 
test. One such pair is: 

A U D We should buy foreign products only when American goods are not 
available. 

A U D In this day of economic interdependence, it is desirable that we buy 
goods from other nations. 
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The Consistency score is the total number of times the student gives the same 
answer ( i.e., consistently liberal or consistently conservative ) on both items 
of a pair. A low Consistency score is found when responses are influenced by 
plausibility and appeal of verbal symbols rather than by well-formulated 
and consistent beliefs. Like other investigations, work with the Scale of Be- 
liefs indicates that conservatism on one problem is highly correlated with 
conservatism regarding other issues. 

Comparison of Testing Techniques 

Several studies have tried to establish the comparative merits of the 
Thurstone and Likert techniques. Since corresponding scales of the two 
types have high correlations, there is little difference except in convenience. 
For most purp oses, it matters little which method is used^ In one study, 
comparable scales' were'cSnstrucfeH” Uy both tedmiques. The Qpjrelation 
b etween tli em yvas as hi^ ,a&jQ§^ which shows them to be practically inter- 
changeable (12). 

The tim e re quii-ed to construct a Thurstone- type sca l e is a drawba ck. Care- 
ful co nstruct ion of a Likert-type scal e reguffes abo ut half the ^j pie need ed for 
the Thmston e procedure ( 12 ) . 

The two scales have established comparable reliabilities. Likert and others 
found that Thurstone scales yielded substantially higher reliabilities when 
rescored by the Likert method. For several scales and several groups, the 
Thurstone scales had split-half coelBcients from .42 to .87, but when rescored 
with the Likert procediue the coefficients ranged from .67 to .95 ( 20 ) . There 
is no evidence, however, that this higher reliability produces higher validity. 

The factors which make for invalid self-report are equally present in both 
tests. The Tjkert met hod permits resp onse xeL s .to .influe nce the score, which 
might lower validity. Where directions for the Thursto ne..scale-req uire one 
t o ch eck, say, the six statements with which he most agrees, no response, sets., 
affect the scQ,re. 

The Likert test i.s more diagno stic than the Thu f Stpne met hod. Every sub- 
ject responds to every item, so that item analysis gives a picture of reactions 
to specific issues. In scales measuring morale, several investigators have made 
item analyses to show just what hopes and fears were most common in par- 
ticular groups (6, 9). 

8. What advantages and disadvantages would you expect from directions, on 
Thurstone scale, to check each opinion statement “yes” or “no,” rather than 
marking just the six statements one most accepts? 

9, List the response sets to which the Likert method is susceptible. 

10. Do reliability coefficients for attitude scales appear to be high enough for the 
uses to which they are most often put? 

Other Techniques 

Some instruments have sought to get beneath verbal abstractions by asking 
the subject what he would do in a particular case (24). Thus, in 1940, one 
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Fig. 79. Photographs used to measure prejudice against Negroes (16). 

investigation asked subjects what they would do to encourage war if Hawaii 
or California were threatened by an enemy (1). These predictions were 
thought to be a better reflection of tendency to act than mere expressions for 
or against war in the abstract. 

Horowitz departed from the purely verbal method by pr esenting photo- 
ji'aphs of pleasant boys,Jboth wh ite and Negro, He used such directions asT 
“Pick ouf 'the dne '^iTlike best, next best, etc. Show me all those that you 
want to sit next to you on a street car.” The test appears to obtain more dis- 
criminating reactions from boys than verbal instruments where they respond 
to general symbols (16). Expressions of approval or disapproval of pictured 
incidents (race riots, fights between strikers and police) were used in an- 
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other study to measure reactions to real events rather than verbal labels (22). 

Bogardus’ “social distance” method has been used to study race prefer- 
ences. The subject shows the distance at which he keeps the race by stating 
whether he approves their entering his country, entering his vocation, and 
so on up the scale to marrying into his family (4). The method is primarily 
of value for measuring group difierences rather than for obtaining reliable 
individual scores. 

A particularly sound technique, the paired-comparison method, has re- 
ceived less use than it deserves. The subject is confronted with paired terms, 
such as Germans-Italians, French-Russians, Germans-Swedes, etc., each term 
being paired with another term. The subject marks the preferred member of 
each pair. From these judgments, one can reconstruct with great reliability 
the rank order of nationality preferences and can determine a scale value or 
acceptance score for each group (32). Typical results from the method are 
shown in Figure 80. Paired comparisons have the advantages of being pre- 
cise, free from response sets, and relatively hard to falsify, since each item 
demands a response. The weaknesses of the method are that it is time- 
consuming, and it does not measure what each person believes about each 
stimulus, as it measures relative preferences only. For example, if a certain 
course made men more tolerant of all races, the paired-comparison technique 
might show no change in ranks of particular races. Suitable adaptation of 
the method might overcome this difficulty. The possibility of studying rela- 
tions between competing attitudes seems promising. Attitude to universal 
military training might be studied more fruitfully than the usual scale per- 
mits, if subjects indicated their choice between u.m.t. and lowered taxes, 
u.m.t. and an additional year of college for qualified youth, u.m.t. and an 
expanded standing army, etc. 

PROBLEMS OF ATTITUDE TESTING 
Validity 

Attitude tests have been severely cri ticiz ed .becanse_tliey-4iave-beeiW:ised-- 
without t heir validity h aying been established ( 21 ) . The failure to. demou- 
^tfateTEat they are vali d measures of belief is in part due to the difficulty oL 
finding criteria. Attitude tests were d esigne d, notto repace less conyeBient 
ways of measuring attitudes, but to fill the need for any sort of measuring 
d evice. We know little about a man’s attitude except what he tells us, so that 
there is no sure way of comparing his self-report, his “public” attitude, with 
his true private beliefs. 

Some investigators have limited their puipose to determining the sub- 
ject’s publicly verbalized opinions. If that is the purpose of measurement, the 
self-report test has, by definition, a high degree of validity. 

To determine whether verbal expressions of attitude are honest and “real,” 
it is necessary to compare them with outside criteria. That they have some 
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relation to behavior is shown by comparison of test scores of groups having 
known attitudes. Thurstone and Chave determined the test scores of church 
members and non-church members, with the results .shown in Table 65. If 
what people do reflects their beliefs, the attitude test measures belief in the 
church. But even if a test is valid in comparing groups, scores of individuals 
may not all be valid. Just as tliere are hypocrites who profess socially ap- 
proved opinions, there are those who join groups even though they do not 
believe in the objectives of the groups. Such a person might praise prohibi- 
tion and join a temperance organization, while privately drinking, and own- 
ing distillery stock. 

11. Some church members, according to Thurstone’s data, have le.s.s favorable 
attitudes toward the church than .some non-churchgoers. Is this possible if 
statements on tlie attitude scale represent true private attitudes? 

12. Outline a procedure for testing the validity of a scale to measure attitude 
toward college fraternities, by comparing groups whose probable attitude is 
known. 

13. Do the same for a scale intended to measure attitude toward chain stores. 


Table 65. Percentage Distributions of Scores on Thurstone-Chave Scale for 

Selected Groups (33) 


Attitude 

Score 

Divinity 
Students 
(N = 103) 

Attend 
Church 
Frequently 
(N = 678) 

Do Not 
Attend 
Church 
Frequently 
(N = 692) 

Active 
Church 
Members 
(N = 581) 

Not Active 
Church 
Members 
(N = 781) 

Favorable 

0-1 


— 


— 



1-2 

17 

18 

1 

18 

2 


2-3 

50 

42 

6 

42 

10 


3-4 

23 

23 

11 

20 

14 


4-5 

5 

10 

14 

10 

14 

Neutral 

5-5 

4 

4 

17 

6 

14 


6-7 

1 

2 

21 

3 

18 


7-8 

— 

1 

14 

1 

15 


8-9 

— 

1 

12 

— 

11 


9-10 

— 

— 

3 


3 

Antagonistic 

10-11 

— 

— 

— 

— 

— 


Stouffer validated a Thurstone-type scale on prohibition in 1929 by obtain- 
ing 238 autobiographies from college students. The case history included 
self-reports on significant experiences with drinking. The agreement between 
test scores and judges’ ratings of attitude from the case histories was very 
high, the correlation being .81 (.86 when corrected for unreliability of 
judges) (31). 

The Scale of Beliefs was validated by comparing it with teachers’ judg- 
ments of pupils. Agreement was claimed in 90 percent of the cases (29, p. 
227). Similar agreement was reported between the test and careful studies 
of 18 pupils through interviews and records of points of view expressed in 
classwork. Evidently the tests do represent fairly the public, verbal opinions 
of students. 
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A more penetrating att ack u pon Ae validity, of attitude expressions studies 
the behavior of subjects. If one’s attitude is his tendency to react favorably 
or unfavorably to an attitude object, a valid test should predict his reactions. 
R^tions_are.difflcult,tQ ..observe. They, too, are influenced-by. the desire to 
make a good impression on others. Fbrjnany.attttude.Qbjects,, there are few 
daily circumstances where one can act out his attitudes. 

One of the successful attempts to measure actual behavior was Corey’s 
study of attitude toward cheating (7). A college psychology class took a 
Thurstone-type scale regarding cheating, which was unsigned but secretly 
identified by the instructor. The group stated that they approved of honesty. 
Actual behavior was then studied experimentally. A regular course examina- 
tion was given each Friday. On Monday, papers were distributed to their 
owners to be graded in class. There was ample opportunity to change an- 
swers while scoring. Cheating was measured by comparing the score after 
class correction with a score recorded secretly by the instructor before re- 
turning the jiapers. Under these conditions cheating was found, despite the 
professed attitude. There was a correlation of only .02 between attitude and 
actual amount of cheating. 

LaPiere’s findings on verbal attitudes and behavior were similarly nega- 
tive. Traveling with two Chinese companions, he found them refused service 
in hotels, restaurants, and auto camps only once in 251 times. Six months 
later he questioned each establishment by mail as to its policy. Over 90 per- 
cent of the respondents said that they would not serve Chinese. Says LaPiere, 
"Those factors which most influenced the behavior of others toward the Chi- 
nese had nothing to do with race. Quality and condition of clothing, appear- 
ance of baggage, cleanliness and neatness were far more significant for 
person to person reaction in the situations I was studying than skin pigmen- 
tation, straight black hair, slanting eyes, and flat noses” (17, p. 232). 

These findings indicate tliat it is i mcessarv to be skeptical of selfi j^port- 
attitude scales.,^ They are useful only when o ne e^ hasizes that they repre- 
^ enfp u blicly .stated reaction's to symbols. SucITopinions are often important, 
especially if a group or individual states socially disapproved opinions. For 
example, a high-school teacher should try to increase the tolerance of a class 
professing intolerance. On the other hand, one could not conclude that a 
group giving “tolerant” answers was free from prejudice. An interviewer finds 
tiiat uneducated adults report more premarital intercourse than educated 
adults; this does not prove that their conduct is different privately, but it 
does show a difference in the standards they apply in deciding what to tell. 
Shifts in overt verbal attitudes are important. If racial prejudice scores rise, 
tliis may be a mere bringing to the surface of antagonisms already present. 
But the fact that this antagonism is expressed or repressed shows significant 
social change. 

Attitude tests are most likely to be valid when the subject has no motive 
to conceal his attitude. Attitude objects toward which there is no “right” 
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behavior are more likely to elicit truthful self-report than those ■which in- 
volve prestige and group approval. The conditions of testing may aflfect re- 
sponses, as shown by an experiment of Robinson and Rohde (27). They used 
the single-question interview, sometimes with non-Jewish and sometimes 
with identifiably Jewish interviewers. To the question, “Do you think the 
Jews have too much power?” percentage of “yes” responses was as follows: 


With non-Jewish interviewer 24.3% 

With Jewish interviewer 15.6% 

With Jewish interviewer, using Jewish name 5.8% 

With non-Jewish interviewer, non-Jewish name 21.4% 


A significant limitation peculiar to attitude tests is that they readily-bet 
fiOSfdSbXbTpt?:. Social issues change, and opinions that are radical in one 
generation become conventional in another. Statements may take on new 
meaning with changing events, even when the words remain unaltered. Thus, 
reactions to a scale regarding censorship no doubt change when public con- 
troversy centers around “subversive” books and movies, rather than around 
“obscenity.” Certainly an old attitude scale must be scrutinized with care 
before use. Statements must be checked for obsoleteness, and the key must 
be checked to make sure that opinions once thought “liberal” are still so clas- 
sifiable. Even with the Thurstone judging procedure, the scale value of 
items on attitude toward war changed markedly from 1930 to 1940 (13). 

14. In Corey’s study, what factors prevent one from assuming that amount of 
change in score is a completely valid measure of attitude on cheating? 

15. What conclusions are warranted by each of the following findings? 

a. After a course in philosophy of education, a class of prospective teachers 
is much more favorable to “democratic classroom practice.s" than at the be- 
ginning of the course. 

b. Pupils in school A have a liigher average score than pupils in school B, 
on a scale measuring the extent to which pupils feel that they are treated 
fairly by their teachers. 

16. Would you expect true or false self-reports on a signed attitude test dealing 
with each of the following topics? 

a. Child-rearing practices. 

b. Tax reduction. 

c. Racial segregation. 

Intensity of Attitude 

j^titude scales do not show the,inten.sity with -which peo ple bold a favor- 
j.6le or un favorable opinion. Three people may show the same critical atti- 
tude toward communism. One may have chosen that opinion after little 
thinking and information and be open to conviction in either direction by 
any new argument. The second person may have reached the same opinion 
after careful study and may be diflBcult to convert to any other attitude. The 
third opinion may be based on conditioning and intense emotion. The only 
device so far used to measure intensity has been to ask subjects how strongly 
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they believe a given opinion. There is no evidence that such statements are 
valid. 

Neutral scores on scales have been difficult to interpret. If a person's score 
falls midway between the two extremes he may be unacquainted with tlie 
attitude object, he may be genuinely indifferent, or he may have conflicting 
feelings for and against which cancel out on the scale. As in personality tests, 
extreme scores are likely to be meaningful, whereas all we can say about a 
person falling near the middle of the scale is that his beliefs are not consist- 
ently extreme. 

Complexity of Attitude Objects 

^.p£rsistent.-pK>blem in attitude testing is definition of the attitude object.. 
The scale is best-eonsideredas mea.siiring response to a verbal symbol, rather 
ihan-4o-what"tIie symbol stands for. Stagner tested attitude toward fascist 
practices in a test where the word “fascism” was not mentioned. Persons 
quite unfavorable to the symbol “fascism” were willing to endorse principles 
enunciated by Hitler and Mussolini. Stagner therefore considers the usual 
scale, in which the attitude object is overtly named, as merely a mejaaire^of 
v erbal cg jaditiQnings to the .stereotyped meaning of, the word used. (30). 

Irfreal life, on the contrary, behavior is not based on verbal labels, as this 
comment shows (18): “The question ‘Wliat would you do if you met your 
friend Wang Chi, the Chinese, dres.sed in his best and beaming with good 
humor, on the corner of Sacramento Street and Grant Avenue, the day before 
New Year, when you were alone and in no hurry?’ might constitute a sym- 
bolization of an actual situation. The question, typical of ‘attitudinal’ ques- 
tionnaires, ‘Do you consider that Chinese make good American citizens?’ 
presents a vague symbolization of an exceedingly vague situational abstrac- 
tion. What is a Chinese? What is an American citizen? What is good?” 

Attitudes toward general concepts may not extend to specific instances. 
Inconsistency of attitudes is shown when those who orate about “freedom of 
speech” attempt to restrict certain groups which they consider subversive. 
AtHtiidp.-<ic alR.<; rarely rpeasure attitudes toward specific objecia; rather, they 
concern “education,” “the church,” or “labor unions.” Yet one is likely to react 
differently toward different churches or different unions. One can, in Thurs- 
tone’s scale, react to die church in terms of its influence on society, in terms 
of the logic of its theology, in terms of its satisfyingness as a social center, and 
so on. A person may approve the social influence of the church and enjoy it 
as a social meeting place) without himself believing in religion. A deeply 
religious person might have doubts about some social practices of the church. 
It would be unwieldy to construct separate scales for each aspect of an 
attitude object. The justification for grouping many aspects is the statistical 
tendency for attitudes toward the various aspects to be similar. One cannot 
reason from a general attitude that an individual holds the same opinion re- 
garding all phases of the attitude object. 
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Bank robber 

Gangster 

Kidnapper 


Smuggler 
Quack doctor 
Bootlegger 


Gambler 

Pickpocket 

Petty thief 
Drunkard 

Speeder 


Beggar 


Tramp 


Fig. 80. Ratings of crimes before and after seeing 
the movie "Streets of Chance" (25). 


In studies of morale in particular, one must be wary of placing too much 
emphasis on general level of score. Such composite scores conceal variations 
in morale regarding specific aspects of the situation. A worker may have 
quite different attitudes toward the work itself, his foreman, opportunity 
for promotion, management, etc. (14). In wartiine, morale regarding the 
chances of victory may be high; at the same time morale may be low regard- 
ing the effect of the war on the standard of living, or one’s danger as a civilian 
from enemy bombing (19). Diagnostic morale scales are needed for full in- 
sight into significant grievances, anxieties, and sources of satisfaction. 

17. From the statements listed on page 369, what aspects are included in the 
scale measuring attitude toward communism? 
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18. Show that a person could, without contradicting himself, mark statements 
having quite different scale values in the communism scale. 

19. Show that “attitude toward compulsory military training” may be consid- 
ered as a composite of reactions to many more specific attitude objects. 

APPLICATIONS OF ATTITUDE TESTS 

If attitude tests were completely valid indices of belief, they would have 
great practical significance. A typical practical question where a measure of 
hue private attitudes would have untold value is in monitoring the new 
education program in Germany and Japan to determine if democracy is 
really being learned. In the absence of unqualified validity, many investiga- 



2 3 4 5,6 7 8 9 10 

Unfavorable Favorable 


Attitude toward the Negro 

Fig. 81. Attitudes toward Negroes before and after seeing the movie 
Birth of a Nation (25). 

tors have nevertheless applied attitude ‘scales to research on the professed 
attitudes of pupils, voters, and workers. 

Much attention has been given to determining how attitudes can be 
changed. One series of studies investigated the influence of motion pictures 
upon attitudes of youth (25). A typical finding was that after seeing a 
motion picture “Street of Chance” students ranked gambling a more serious 
crime, as shown in Figure 80. The effect of seeing “Birth of a Nation” on 
attitudes toward the Negro was also marked (Fig. 81).“ Other studies of atti- 
hides have proved definitely that appropriate educational procedures pro- 
duce marked shifts ( 23, pp. 946 ff . ) . 

In schools, attitude scales have been used for evaluation, particularly in 
tbfnT^Tjrciti^pTnsKi^ ieducation. A- major purpose of teaching is to form 

“ Figs. 80 and 81 from Peterson and Thurstone, Motion pictures and the social atti- 
tudes of children. Copyright, 1933, by The Macmillan Company and u.sed with their 
permission. 
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Standing in Relation to the 
Average Instructor Rating 

ITEMS SURVEYED TOO 95 90 85 80 75 70 


II. Knowledge of subject matter 


1 ! 

III. Organization of lectures 

IB 

1 1 

1 

IV. Clarity of presentation of subject 

oi 


V. Effectiveness in use of training aids 


1 

1 

VI. Ability to arouse and maintoin interest 

m 

Hn 

VII. Completeness of answers to student 

m 

■i! 

questions 

1 1 


VI 11. Understanding students' ability to absorb 

1 


and use material 


gWM 

IX. Voice quality 




X. Ability to state thoughts fluently 

mam 

1 1 

1 1 

»> 1— 1— 

XI. Adequacy of explaining new vocabulary 


Ul , » 

•2 ! 1 

and technical words 


> 'i 1 

< 1 1 

XII. Control of class 


I 1 

1 1 

XIII. Developing class interest 

i9 

1 ■ 

1 1 

1 1 

XIV. Attitude of helpfulness toward cadets 


v.Ll 

XV. General effectiveness of instructor in 

■MS 

MKm 

putting subject across 


\ ^ 1 


Fig. 82. Profile chart for a navigation instructor (5, p. 100). 


attitu des , : appre ciation o£ literature, scientific open-miiideduess-tosKard de- 
bata ble qiiestions. or interracial tolerance. The school must measure these 
attitudes if it wishes to be sure it is accomplishing its purpose. Testing of 
attitudes draws attention to individuals with undesired attitudes, so that spe- 
cial effort can be made to change them. Attitude tests also direct the pupil’s 
attention to the fact that the school is concerned with more than factual 
learnings. 

.Attitu de tests are effectively u sed in civilian and militar y sc hools to d e- 
teriniii.e..students’ opinions onjthe^^traihing. The AAF navigation training 
program devisM~a“Survey of StudenT'CJpinions covering many aspects of 
the training. Scores could be compared from instructor to instructor and from 
month to month, calling attention to weak points in the program ( 5, pp, 97, 
168 ff . ) . Such information can be very constructive when used as a basis for 
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self-improvement but is damaging to instructor morale when it brings criti- 
cism from liis superiors or a tlireat of discharge. 

In industry, scales measuring attitude toward the job have been applied 
to compare plants, departments, and groups of workers ( 3 ) . Items from a 
scale by Bergen include the following (scale values in parentheses) (2): 

On the whole, the company treats us about as well as we deserve (6.6). 

In my job I don’t get any chance to use my experience (3.2). 

I think the company’s policy is to pay employees just as little as it can get 
away with (0.8). 

Such surveys are especially helpful if they are made the basis for correcting 
sore spots. Item analysis of the questionnaire is a starting point toward noting 
grievances, and interviewing employees often gives further insight into 
causes of dissatisfaction. Morale-measuring techniques are most helpful 
when representatives of the workers have a hand in planning the investi- 
gation. 

Attitude scales have been adapted to studies of morale. The effect of the 
depression upon the confidence and optimism of adults was studied by Rund- 
quist and Sletto (28). Similar measures were used to study the development 
of morale during wartime by a number of investigators. Attempts were made 
to identify the background factors related to good and poor morale. The gen- 
eral finding was that age, sex, occupation, and racial background were less 
effective in determining morale than the emotional adjustment of the person 
to hi.s individual hardships (6, 9). 
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CHAPTER 1 8 


Observation of Behavior in Normal 
Situations 


The preceding chapter concluded the discussion of self-report techniques. In 
the present chapter and the two following, attention is sliifted to determina- 
tion of typical behavior by direct observation. Observation methods can be 
highly valid when proper precautions are taken, but adequate observation is 
difficult and laborious. Jn the present chapter, we consider methods of ob- 
taining data by observing the subject in such natural situations as school, 
playground, and shop. 

Observation in normal or field conditions and observations in test situa- 
tions, to be considered later, are distinguished in terms of whether we pro- 
vide fixed.test stimuli to which the subject responds. Field observations are 
often relied on merely because data of this type are readily obtainable. There 
are times, however, when they are used in preference to more standardized 
procedures and test stimuli. Many investigators feel that it is impossible to 
know personality unless we watch the subject react to the conditions that are 
most significant for him. Yet different stimuli provoke the most revealing 
behavior for different people. Test situations, unless unusually fortunate in 
their clioice of stimuli, are not as bkely to demonstrate the important be- 
havior patterns of all subjects as are the normal conditions under which they 
live. The difficulty lies in seeing enough of the person’s normal behavior, and 
of obtaining dependable records of that behavior. 

DIFFICULTIES OF FIELD OBSERVATION 
Sampling Error 

Whenever one wishes to know the tyjDical behavior of an employee, a 
student, or a patient, the most direct way to~Bncrout.is to obs erve him in 
ngriiial..situations. If he does n ot know that we are watchi ng him , we 'obtain 
a tr uthful pictu re of his characteristics, l imi ted QB.ly By ,wr skill as obsei-vers. 
This is our usual basis for judging associates and f riends. Tu dp^ments bn.sed 
upon ob servation, however, are iit(^y to be completel y untru stwort hy owing 
jo sam pling, errors and observer errors. 

To know the “typical” behavior of an individual, it is necessary to know 
how he characteristically acts in a particular situation. But situations change 
from da y to d ay.-aiid.from moment to moment. If we observ^tEe attentive- 
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ness of an employee before lunch, we may get a different impression from 
the one we get if we observe in midafternoon. If we observe cheerfulness or 
politeness when he is worried, our impression may be unfair. The only way 
to be even moderately certain that the behavior is typical is to study the 
subject on many occasions of the type about which we wish to generalize. 
This is usually expensive, and one cannot acquhe a fair picture in a short 
time. In practice, it is usual to compromise between perfect sampling and 
economy.'' 

A second difiSculty of inference about individual differences arises from 
the fact that one can never obsei-ve two individuals in the same situation. 
Even when the situation is externally similar, previous conditions cause peo- 
ple to behave differently. Two children, side by side in a classroom, are ex- 
posed to much the same stimuli. When Jimmy fidgets more than John, one is 
likely to infer that Jimmy is "restless,” “nervous,” or “jumpy.” If that im- 
pression is confirmed by repeated observations, one might be tempted to con- 
sider this difference in behavior fundamental. But if Jimmy usually comes 
to school without breakfast, if he expects to be criticized by the teacher for 
poor work, or if he is large for the chairs provided, the difference in activity 
may tell nothing about the boys’ basic restlessness. In fact, if conditions were 
reversed, Johnny might be more restless than Jimmy is now. At_bfist,- comr 
pa rative ob s.eivations show how different people act under tlieir present con- 
ditions but do not guarantee that these differences would persist if back- 
grbund" Conditions changed. 

Observer Error 

Whenever a person observes an event, he notices some happenings and ig- 
nores others. This is a necessary difficulty, since any activity has too many 
aspects for die human mind to attend to at once. Especially in social situa- 
tions, where behavior observations are most needed, the complexity of inter- 
action prevents exhaustive reporting. If errors in observing were merely 
random omissions, they would be unimportant. But observers make sys- 
tematic errors, overemphasizing some types of happenings and failing to 
rep^t other types. 

Viewing the identical scene, observers give widely different reports. The 
following reports were written by fomr observers, each of whom saw the 
same motion-picture scene of about ten minutes’ duration (from the film 
“Tliis Is Robert” ) . Tlie scene was shown twice without sound, and they were 
directed to note eveiything they could about one boy, Robert. The scene, in 
a first-grade classroom, showed several activities which revealed much of 
Robert’s personality. Numbers in these accounts, referring to scenes in the 
film, were inserted in order to permit comparison of different accounts. Pa- 
rentheses are a device used by the observers to set apart any inferences or 
interpretations. 
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Observer A: (2) Robert reads word by word, using finger to follow place. (4) 
Observes girl in box with much preoccupation. (5) During singing, he in general 
doesn’t participate too actively. Interest is part of time centered elsewhere. Ap- 
pears to respond most actively to sections of song involving action. Has tendency 
for seemingly meaningless movement. Twitching of fingers, aimless thrusts with 
arms. 

Observer B: (1) Looked at camera upon entering (seemed perplexed and 
interested). Smiled at camera. (2) Reads (with apparent interest and with a fair 
degree of facility). (3) Active in roughhouse play with girls. (4) Upon being 
kideed (unintentionally) by one girl he responded (angrily). (5) Talked witli 
girl sitting next to him between singing periods. Participated in .singing. (At times 
appeared enthusiastic.) Didn’t always sing with others. (6) Participated in a 
dispute in a game with others (appeared to stand up for his own rights). Ag- 
gressive behavior toward anotlier boy. Turned pockets inside out while talking 
to teacher and other students. (7) Put on overshoes without assistance. Climbed 
to top of ladder rungs. Tried to get rung which was occupied by a girl but since 
she didn’t give in, contented himself with another place. 

Observer C: (1) Smiles into camera (curious). When group breaks up, he 
makes nervous gestures, throws arm out into air. (2) Attention to reading lesson. 
Reads with serious look on his face, has to use line marker. (3) Cha.ses girls, 
teases. (4) Girl kicks when he puts hand on her leg. Robert makes face at her. 
(5) Singing. Sits with mouth open, knocks knees together, scratches leg, puts 
fingers in mouth (seems to have several nervous habits, though not emotionally 
overwrought or self-conscious). (6) In a dispute over parchesi, he stands up for 
his rights. (7) Short dispute because he wants rung on jungle gym. 

Observer D-. (2) Uses guide to follow words, reads slowly, fairly forced and 
with careful formation of sounds (perhaps unsure of self and fearful of mistakes) . 
(3) Perhaps slightly aggressive as evidence by pushing younger child to side 
when moving from a position to another. Plays with other children with obvious 
enjoyment, smiles, runs, seems especially associated with girls. This is noticeable 
in games and in seating in singing. (5) Takes little interest in singing, fidgets, 
moves hands and legs (perhaps shy and nervous) . Seems in song to be unfamiliar 
with words of main part, and shows disinterest by fidgeting and twisting around. 
Not until chorus is reached does he pick up interest. His especial friend seems 
to be a particular girl, as he is always seated by her. 

1. What do you think really happened in scene (4)? Which observer came closest 
to adequate reporting of it? 

2. Which of the numbered scenes appears to give the most significant informa- 
tion about Robert? How many of the observers reported that information? 

3 . To what extent have the observers succeeded in identifying and marking all 
their judgments and hypotheses? 

4 . Do the observers ever disagree, or are the differences entirely due to omissions 
and oversights? 

All be biased witnesses,^ The foreman, for example, is 

more likely to note an error made by a worker of whom he is critical than the 
same error made by one of his favorites. It is usual for us to form impressions 
of those we meet, so that we would willingly describe them if asked to. Hav- 
ing made a judgment, we are then prone to note and remember those events 
which support that judgment. 
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Eve ry, observer is more, sensitive to some, types, of behavior than others. 
How does he regard nail biting, failure to look one in the eye, or profanity? If 
he considers these significant, he will be careful to note them and base his 
impression on them. In the same situation, another observer might give 
greatest attention to voice modulation, careful use of grammar, or friendli- 
ness of conversation. Ideally, an observer would base his impression on every 
revealing act, but when he is looking for one thing, he necessarily overlooks 
something else. One famous psychologist used to demonstrate this to his 
classes by requiring them to attempt to report “everytliing” he did in a short 
period. He then ostentatiously lifted a piece of apparatus above his head and 
manipulated it. The class, observing, always failed to report that during this 
process his “idle” hand unbuttoned his coat, removed his watch and laid it 
on the table, and took a cigarette from his case. 

Observers interpret what they see. If they recorded only objective facts, 
others studying the data might reach quite different interpretations. Since 
people always try to give meanings to what they see, they find it difiBcult to 
avoid assuming meanings which may be untrue. Some time ago, the writer 
fell victim of this error in the following way. While lecturing to a new class, 
he noted repeatedly one student shaking his head from side to side. This, ap- 
parently coming just at the end of an explanation, caused tlie writer to re- 
state his material, attempting to be more clear. With luck, there would be no 
head-shaking, and the instructor would go on to another point, but time after 
time the head-shaking appeared, causing him to go back over material, at- 
tempting to remove whatever confusion was causing the student to disagree 
so constantly. After two sessions of this backing and filling, the head-shaker 
came to die instructor after class to explain that he was not disagreeing, but 
was afflicted with an uncontrollable nervous tremor. Knowing this, the writer 
later noticed that the head-shaking was present many times when it could 
not be interpreted as a sign of disagreement. When we make an interpreta- 
tion, we tend to overlook facts which do not fit the interpretation, and may 
even invent facts needed to complete the event as we interpreted it. 

^^11 observer errors are heightened when memory is permitted to ope rate.. 
We tend to remember dii ngs M^ich fit our impressions and biases, and events 
i^3cF~am dramatic^ hut we forget many events that did hot impress us 
deeply. In most non-scientific work, opinions of people are based on memory 
rather than careful records. When a vacancy occurs the boss promotes the 
worker whom he remembers as showing the best qualities. Another worker, 
who performed equally well, may not be recalled as favorably. Errors of 
memory even distort reports of characteristics such as height and punctuality 
which can be observed objectively. 

5. A clinical psychologist asks a parent how well his six-year-old child gets along 
with other children. Illustrate how each of the following errors might operate, 
a. The observer has not observed an adequate sample for judging typical be- 
havior. 
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b. The observer notices events which fit his preconceived notions. 

c. The observer is most likely to note the behaviors he considers significant, 
and to ignore others of equal importance. 

d. The observer may give a faulty interpretation to an event. 

REDUCING ERRORS OF OBSERVATION 
Time Sampling 

If observations are made only as opportunity suggests, there is a strong 
probability of faulty sampling. One of the best ways of obtaining data which 
permit precise comparison of dififerent individuals is time sampling. In time 
sampling, a set schedule of observations is planned in advance. Since it is 
usually not possible to observe every subject at the same time, the schedule is 
randomized so that each subject is seen under comparable conditions. In one 
study of social contacts of preschool children, for example, a schedule of one- 
minute observations was drawn up. After the observer watched a child for 
one minute, noting all social interaction, he wrote down a full record. Chil- 
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Fig. 83. Record obtained In a five-minute observation of a preschool child 
(31, p. 43). The diagram shows the nursery school play yard and traces 
Edward's movements. Letters mark the start and finish of each activity. 
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dren were watched in a predetermined order, which was altered from day 
to day. During the study, each child was observed an equal number of times 
during the first five minutes of the free-play hour, during the second five 
minutes, and so on (5, pp. 509-525). 

The advantages of short, well-distributed time samples is that the cumu- 
lative. pictu re is likely to be far more typical than an equal amount of evi-. 
dence.ab.taiiifid in a few longer observations. Moreover, errors of memory are 
.^reduced, since the observer can make full notes during or just after observ- 
ing. Time samples are especially suitable for recording specific facts that can 
nuinerically,. such as the number of social contacts between 
chil dren . A slightly more elaborate record shows the complete activity pat- 
tern during the observation period. The record shown in Figure 83 reports 
the behavior of Edward during a five-minute period. It indicates what he 
did, for how long, and where. 

6. Why might an unfair picture of a child’s behavior be obtained if he were 
always observed during the first five minutes of the play period and never 
during the second five minutes? 

7. What sorts of information about Edward’s personality could be obtained 
from a cumulation of records such as Figure 83? 

8. What sorts of information about Edward’s behavior lue discarded, in making an 
objective record such as Figure 83? 

Photographic and Phonographic Recording 
Many of the error's in observing have been minimized by replacing the 
hninan obsewer. Unlike a human observer, mechanical recording devices 
can record several events occurring simultaneously, can watch several people 
at once, and can remain “on the job” continuously for as long as needed. A 
motion-picture camera sees virtually everything the eye can observe, and 
recor ds it with no significant error. Sound recording under adequate acoustic 
conditions can fix for later reference all the conversation and other sounds the 
observer would notice. Recording therefore eliminates need for an expert 
observer at the scene of action. 

Because the record notes, within limits, everything that happens, it is 
especially valuable in noting significant facts that an observer might miss. 
Events tiiat seem trivial when they happen may become important in the 
light of later events. Thus, an observer is often jjuzzled when a subject loses 
his temper or behaves erratically for no apparent cause. A record permits 
one to retrace tire preceding events to detect probable causes. 

Recording has limited use because^ it is expensive, but it will probably be- 
come more feasible as technical improvements are made. Recent devlop- 
ments have reduced costs and provided equipment tlrat can be operated 
easily. 'The motion picture especially is costly, and can rarely be used in typ- 
ical situations. Some recent studies have substituted still pbotographs taken 
automatically at intervals such as fifteen seconds. While these films do not 
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show motion, they are adequate for most events. Records of this type for a 
preschool group, for example, can be analyzed to determine which children 
play together and for how long. One of the greatest problems with recording 
devices is the difficulty of utilizing the records obtained. Even a fairly brief 
recording session produces a huge .mass_offilm or sound recording which can 
only be reduced to convenient form by careful and well-planned analysis. 
Any analysis must discard some of the information recorded, tq_center atten- 
tion on the most significant data. 

Among the uses to which photography has been put are the following. 
Behavior of visitors to a museum exhibit has been recorded by a concealed 
camera taking photographs at frequent intervals ( 23 ) . The records showed 
the time spent at each exhibit, tlie order in which the exhibits were ap- 
proached, and other facts of value in exliibit planning. In aviation research, a 
motion picture of the pilot’s head and shoulders, made by a camera mounted 
in the cockpit, recorded his behavior during landing. Direction of glance was 
studied as a means of improving landing technique ( 32 ) . In gunnery training 
and evaluation of combat performance, cameras were synchronized with gun 
triggers used by aerial gunners (13). As the .shot was fired, the camera made 
a record of the point of aim. The film yielded an accuracy score and peimiit- 
ted analysis of such causes of misses as inadequate lead. 

Sound recording on disks has been expensive and usually required the 
constant services of a technician. Wire recorders and tape recorders have now 
been improved so that they are adequate for most obsei-vations. With a con- 
cealed microphone, records one hour or more in duration can be made at 
low cost and with little attention. Such records have been used to evaluate 
and train counselors and classification interviewers (7, 25). If a human ob- 
server were in the room, a counselor could probably not do typical work be- 
cause of his client’s consciousness of the observer. This factor is greatly re- 
duced by automatic recording, altliough even so the counselor may perform 
differently because he knows of the microphone. Recordings of radio and 
battlephone communications have been used to study the performance of 
military personnel. Tlie records permit analyzing clarity of speech and skill 
with which tactical duties are performed (6 ) . 

In addition to direct records of behavior, a few devices have been tried for 
recording the results of a person’s actions. Thus, a "ride-recorder” was tried 
as a basis of objectively recording pilot perfomiance. As the pilot executed 
standard maneuvers, the recording instiument noted changes in heading, tilt, 
and altitude of the plane ( 33 ) . 

9. Discuss the advantages and disadvantages of still pictures taken in sequences 
about one second apart, as compared with a human observer for each of the 
following puiposes. 

a. A city engineer wishes to study the traffic patterns at an intersection to 
determine what difficulties cause ti-affic jams. 
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b. In a study of rat performance, it is important to note both time and specific 
errors made in maze running. 

10. Discuss the advantages and disadvantages of sound recordings as compared 
with a human observer for each of the foflowing purposes. 

a. An investigator wishes to measure the skill with which different teachers 
handle discipline problems. 

b. A teacher of dramatics wishes to judge her students’ ability to read a 
dramatic selection. 

11. Sound recordings of group discussions are often used to study individual dif- 
ferences in dominance, leadership, and other ti-aits. What types of observable 
information about personality could not be obtained from the sound record? 

Training of Observers 

Most people form their impressions of others haphazardly and inaccu- 
rately. They are misled by prejudices, and they draw conclusions from single 
events which are unwarranted.^Observers should receive training, directed 
particularly to making them self-critical and objective. Training for inter- 
viewers is a well-developed practice in industrial personnel work. Foremen 
are receiving training in many industries to assist them in judging the men 
under them. Teachers receive special training in observation of children. All 
such programs improve tlie quality of resulting data. 

DATA YIELDED BY OBSERVATIONS 
Objective Data 

An observer may record, free-style, what he sees, or he may record certain 
specific items to which his attention is directed. Stand ardization of observa- 
tions-is greatly increased if every observer notes a limited number of specific, 
countable-hebayigrs, as in the time-Aampling method. Objectiyity of observa- 
tions is. obtained by defining cai'efuUy, what is to be noted and proyiding a 
~cEecirhst, schedule, o r record ing form for uniform data. 

The extent to which factory workers attend to fKeir work may be de- 
scribed by a time record which notes the exact moments when they are at 
work, and the time spent in looking around, obtaining tools, and visiting. The 
causes of distractions can also be noted. Such records for different workers 
and departments can be analyzed both for judging the workers and for plan- 
ning rest periods or improved tool distribution. Morrison has used similar 
methods for recording Ae study habits of high-school students ( 20 ) . Such 
records as shown in Figure 84 not only supply precise estimates of time spent 
in study but provide a basis for discussion with the pupil of ways of improv- 
ing his eflficiency and of recording improvement. Child development has 
been studied through records of social contacts, play activities, speech, and 
other objectively defined behaviors { 5, pp. 509 ff.; 31 ) . Such precise reports 
are especially useful for measuring changes, since the observer’s memory is 
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Idly turns pages of book 


Plays with neighbor's hair 


Smiles at boys 


Looks at some writing 
on board 


Gazes about, distributing 
smiles impartially over 
the entire room 


Again becomes interested 
in her neighbor's hair 


Looks at the clock, smiles, and 
whispers to her neighbor 


Puts book away and 
converses with neighbor 



Reads rapidly 


Reads slowly 


Reads about one page 


Continues to read. 
Very much bored 


Looking at pictures 
in the book 


Reads with very ir- 
regular eye movements 


Reads with siow 
eye movement 


Fig. 84. Concentration profile of eighth-grade girl during 
20-minute study period ( 20 ). 


an untrustworthy basis for comparing performance now with performance 
several months earlier. 

Check lists and schedules are also used to systematize observation records. 
The check list used in observing ability (see p. 296 ) generally lists the correct 
actions and the significant errors to be noted. For observing typical behavior, 
a more fiexible system is often required. 'The check list for such work may 
provide a series of categories for classifying acts. Quite different specific acts 
may be recorded rapidly with considerable objectivity. This procedure is il- 



OBSERVATION OF BEHAVIOR IN NORMAL SITUATIONS 


395 


lusti’ated in the studies of Anderson and his associates, who observed teachers 
to determine their tendency to domination or “integration” in leadership. The 
observer checked the frequency of incidents in 28 categories and sub-cate- 
gories such as the following ( 3 ) : 

DC Domination with evidence of conflict 
DC-3 Relocates child (e.g., changes his seat) 

DC-5 Disapproval, blame or shame directed toward child as a person 
DN Domination with no evidence of conflict 
IT Integration with evidence of working together 
IT-16 Approval of self -initiated behavior of child 

Experienced observers, simultaneously recording over 600 incidents, agreed 
closely in their reports (3, pp. 43 ff. ). 

12. What advantages and disadvantages would a check list or schedule have for 
each of the following purposes, compared to a one-paragraph descriptive 
report? 

a. A social agency wishes its visitor to report the condition of homes of its 
clients, including furnishings, conveniences, and neatness. 

b. A department store sends shoppers to be served by its clerks and observe 
their procedure and manner. 

c. A state requires an observation of the applicant’s driving before issuing 
a license to drive. 

13. An investigator wishes to measure pmictuality, for research purposes. He 
stations himself where he can observe the arrival of each student attending 
a particular class. Number of minutes early or late is recorded for each person. 
Records are made on several days (10). What assumptions are involved in 
using the average of these records as an index of punctuality? 

If observers record. their opinions merely as descriptive statements, it be-, 
comes very diflScult to compare one subject with another. In order to permit 
routine-processing of data or statistical analysis, the opinions are frequently 
reduced to ratings on one. nr. more traits. This method is so widely used, and 
presents, such a variety of problems, that it will be treated separately in a 
l ater se ction. 

Anecdotal Records 

In.marked. contrast to nume.rical records and ratings are anecdot^_records. 
Agecdotal records sire an attempt to escape the bleakness , and narrowness of 
quantitative methods, to obtain a full and lifelike picture of the subject. The 
.observer is free to note any behavior that appears significant rather than 
having to conce ntra te on the saine trait s for all subjects. Often the anecdotes 
are "reports of incidents noted by a teacher or supervisor in daily contacts, 
rather than obtained by a special observer. 

lusui anecdotal record, the observer desgribes-exacdy what he obseryedx. 
avoiding mixing interpretation and fact.. The record is made as soon after the 
action as possible, to eliminate errors of recall. Cumulated over a period of 
time, the desci'iptio ns of incidents provide a richer picture of behavior than 
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any qtlier .equally simple-technique. The following are typical anecdotal re- 
ports: 

“Paul, after projecting the film for tlie class, took it back to the office (where I 
happened to be) to rewind. He is not very skilled, and missed his timing, so that 
much of the film cascaded onto tlie floor instead of going onto the takeup reel. 
John came up just then and said something sarcastic about Paul’s clumsiness. Paul 
gave no answer, but kept on at work with no change of manner and a stolid face. 
Richard, who had been watching Paul, turned on John, told him to ‘.shut up and 
give Paul a chance,’ and muttered something about ‘some of these kids make me 
sick.’ [Paul seems to suppress emotion; he certamly heai'd John’s very unpleasant 
tone.]” 

“Joan spent the entire science period wandering from group to group instead of 
helping Rose as she was expected to. She interrupted many of the others, tellmg 
them they were doing the work wrong. She asked a lot of [foolish] questions 
(‘Does filter paper make certain things go through or ju.st keep certain things 
out?’) and was teased a good deal by the boys. By the time Rose was finished she 
returned; Rose was quite angry, but they made up and Joan helped put tilings 
away. But on her first trip to the storeroom she stayed to plate a gold ring wifii 
mercury, while Rose made repeated trips with the equipment.” 

The nbseryer has two responsibilities-for making useful anecdotal records: 
he must select incidents worth reporting, and he must be objective. Two 
types of incidents are. especially helpful: djosc which are cljaracteywtic of the 
.person .^nd t hose wliich are striking exceptions to his normal conduct. The 
typical incidents are helpful because they provide a more individualiz^pic- 
ture of the typical behavior than tlie hackneyed trait names that would other- 
wise be used — ^friendly, showing initiative, rude, and so on. Exceptional ac- 
tions are rarely reported in ratings and general impressions. But they too are 
characteristic of the person. A single incident showing interest in the com- 
pany’s welfare from a man known as a troublemaker, or a sign of enthusiasm 
for learning on the part of a boy who has been rebellious toward school may 
be the key which will open a new and successful method of dealing witli the 
person. 

Objective reporting of anecdotal material requires the same caution as any 
other type of scientific observation. The observer must weed out value judg- 
ments and interpretations, attempting rather to report the exact occurrences, 
including the significant preceding events and environmental conditions. Al- 
though the ideal anecdote would report “everything” about the incident, this 
can never be attained. The reporter selects for his record the facts he con- 
siders relevant. 

§ingle anecdotes-tell -little about a person. As anecdotes accumulate, how- 
begin to-fiU in. a picture of his habits. If a particular response is 
typical, it will recur. An effective, method., of, .determining personality char- 
acteristics is to search ’througb tlie anecdotes regarding a par.tictiiar casTto' 
d etect rep et itions. A summary based on these recurring patterns usually re- 
quires confirmation by further directed observation. 
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14. Examine the first and second of the anecdotal reports given above, and in- 
dicate types of data that may have been discarded by the observer. 

15. What information does the second anecdote give that would not be conveyed 
by ratings on attentiveness, responsibility, cooperation, and sociability? 

RATING SCALES 

lUting^jpales serve not only as a means of summarizing direct observations,., 
b ut also as a device for obtaining descriptions of the subject frpmjudges who 
Tiai'e-been familiar with his typical behayiQr.in .the..pa 5 t,„, Ratings are em- 
ployed as criteria and as basic sources of data for many types of research, and 
are used practically in selecting workers, choosing workers to be promoted, 
admitting students to college, etc. R atings are p o p ular because they can be . 
tr eated n umerically and are superficially uniform from judge to judge. De- 
scriptive statements by interviewers, observers, and acquaintances are hard 


Table 66. Intercorrelations of Ratings Given 1 100 Industrial Workers (11) 


JS 

p 

Safety 

Knowledge of 
Job 

Versatility 

Accuracy 

Productivity 

Over-All Job 
Performance 

Industriousness 

Initiative 

Judgment 

Cooperation 

Personality 

Health 

Safety 

.35“ 

.61 

.52 

.63 

.55 

.60 

.49 

.54 

.62 

.61 

.55 

.25 

Knowledge of |ob 

.61 

.46 

.81 

.85 

.79 

.82 

.78 

.78 

.80 

.67 

.67 

.52 

Versatility 

.52 

.81 

.47 

.80 

.72 

.80 

.71 

.78 

.82 

.68 

.63 

.50 

Accuracy 

.63 

.85 

.80 

.45 

.81 

.67 

.80 

.78 

.84 

.74 

.70 

.84 

Productivity 

.55 

.79 

.72 

.81 

.46 

.86 

.86 

.80 

.81 

.81 

.73 

.45 

Overfall job per- 
formance 

.60 

.82 

.80 

.67 

.86 

.46 

.85 

.83 

.88 

.80 

.74 

.60 

industriousness 

.49 

.78 

.71 

.80 

.86 

.85 

.47 

.82 

.84 

.80 

.67 

.53 

Initiative 

.54 

.78 

.78 

.78 

.80 

.83 

.82 

.48 

.86 

.72 

.72 

.77 

Judgment 

.62 

.80 

.82 

.84 

.81 

.88 

.84 

.86 

.45 

.76 

.75 

.43 

Cooperation 

.61 

.67 

.68 

.74 

.81 

.80 

.80 

.72 

.76 

.37 

80 

.52 

Personality 

.55 

.67 

.63 

.70 

.73 

.74 

.67 

.72 

.75 

.80 

.39 

.71 

Health 

.25 

.52 

.50 

.84 

.45 

.60 

.53 

.77 

.43 

.52 

.71 

.36 


“ Itnllcizi'd figures show correlations of two raters’ judgments on each worker, i,e,, reliability of 
judging. 


to compare because of variations in style of writing. It is impossible at present 
to apply statistical methods to most qualitative reports of behavior. Rating 
scales are commonly used to reduce impressions to manageable form. Rating 
' Scales generally corisTsFoT a^ list bfli'ailS'K'be" fated,' the judge being asked to 
indicate the extent to which each behavior is characteristic of the subject. 

Al l rating schemes are subj ect to severe observer eiTors. One , error,, the 
hdl-n fiff pjiit. has received a great deal of attention. Observers form a general 
opinion of a person, and the ratings they give him on specific tmt5.iireLiar 
fluence d b y this ge neral fpeljng orfavdraHehess or unfaYQrahleness.,-Even 
such an objectively measureable trait as productivity of a worker may be 
rated erroneously, if a poor worker has a pleasing personality. Table 66 gives 
convincing evidence of halo. This table shows the correlation of ratings given 
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1100 industrial employees. Ratings on quite dissimilar traits show a marked 
general factor, apparently corresponding to the foreman’s opinion of the 
man’s industriousness and productivity. A factor analysis shows two other 
factors, skill and health. This scale might as well be reduced from twelve 
traits to two, a general factor and a skill factor. The health rating is too un- 
reliable to be depended upon ( 11 ). 

Generosity error is also common. Whenever possible, raters tend to giye. 
subjects the benefit of the doubt. It is common to find 60 to 80 percent of an 
unselected group rated “above average” because of the urge to speak favor- 
ably if possible. 

Among the many rating procedures in existence, the descriptive graphic 
rating scale seems to have jjarticular iisefulness. From the cumulated ex- 
perience of years of use of rating methods, certain principles for preparing or 
criticizing scales have emerged: 

1. The judge should be asked to rate only those traits which are essential 
for the purpose involved; rarely more than five to seven. As more t rait s are 
Idded, judges give less serious consideration to each, and rely more upon the 
“halo.” 

2. The judge should be urged to indicate which traits he has inadequate 
data for rating. Where a judge is directed to mark every ti’ait, some ratings 
are little better than guesses. Conrad directed judges to star ratings of traits 
winch they regarded especially important in the child’s personality. Inter- 
judge correlations on all ti'aits ranged from .67 to .82. But for the traits which 
three judges agreed in starring, the correlations of ratings rose as high as .96 

( 9 ). 

3. Each trait should be judged along a scale, rather than on an all-or-none 
basis. The judge will rarely report that an acquaintance is discourteous or un- 
punctual on a yes-or-no choice, but he may check such a statement as “some- 
times forgets courtesy” or “is less pimctual than the average.” In g eneral, five- 

_to seven -poi nt scales Aefim-to-serv-e adequately..Under very favorable condi- 
tions for rating, much finer subdivisions of the scale prove profitable ( 8 ) . 

4. Traits and scale positions should be described as unambiguous ly as po s- 
sible. Different words have different meanings for judges, but if ratings are 
to be comparable they must be based on a unifonn question. “Leadership” 
may suggest to one judge conscious wielding of authority, crisp decisions, and 
general dominance. A subject rated high by this judge would receive a lower 
rating from a judge who sti-esses encouragement of subordinates, leading a 
group to coSperative decisions, and subordinating the leader’s desires to the 
decision of the gi'oup. Such words as “average” and "excellent,” used in defin- 
ing tlie scale of judgment, are even more indefinite in meaning. They should 
be replaced by specific descriptions of behavior. 

A good example of the descriptive rating scale is the America n Council on 
_JEducation Personality Report, shown in Figure 85. This form is principally 
used to obtain ratings-of high-_s,chqol graduates for use by college admi ss i ons- 
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Name of student. 


A— 11 0 w tre 
fou and othan 
affected by bis 
appcaraneeand 
manner T 

Q Sought by othera 
□ Well liked by othera 

Q Ukcd by othera 

Q Tolerated by othera 

Q Avoided by others 

Q No opportunity to obierva 

Please record here initancea on which you base your jodgmont 

B— Ooea ha 
need frequent 
proddins o r 
does ha o 
ahead without 
being told? 

Q Seeks and sets for himaelf 
additional taaka 

! Q Completes suggested supple* 
mvntary work. 

n Does ordinary asiignmcata 
of hb own accord 

Q Needi oeeaslonal prodding 

Q Needs much prodding in do* 
log ordinary aaslgnmenta 

Q No opportunity to observe 

Plaaoa record here iutancea on which you base your Judgment. 

C~>D 0 e a he 
cet othera to 
de what he 
wiahea f 

□ Dlaploys narked abiiity to lead 
hb fellowa: makca thlnga go 

Q Sometimea leada In important 
affalra 

Q Samctimei leads in minor 
affalra 

Q Lett others tsko lead 
n I’robably unable to lead hb 
fellows 

Q No opporluolly to ofaaervo 

Pleaia record here iostaneaa on which you baao your Judgment. 

O^BOW don 
ha control bla 
inoUoni? 

Q Unusual balance of reepon* 
alrcneaa aad control 

Q Well balenced 

Q Ueually well balanced 

□ Tenda to be Q Tends to bo 

anreapoBBlTO over 

emotional 

Q UnrospnnaiTt, Q Too easily 
apathetic depressed 

Irritated or 
elated 

□ No opportunity to obeerTO 

Plfosa record baro tnitaRcea on which yon baao your judgmont 

E— Baa he a 
prORnm with 
definite pun 
poaaa in tnmw 
of whieh b< 
dlatrihotei bli 
time and en> 
erBjf 

□ Bngroaacd In realising well 
formnlated objoctivoi 

Q Olrceta energlco effectively 
with fairly definite progran 

Q Baa vaguely formed 
ebjeellvaa 

Q Alme Just to "get by" 

Q Aimleas triller 

Q No opportunity to obitrra 

PlosM record here instanece on which you base your Judgment. 


Fig. 85. The American Council on Education Personality Report, Form B. 

officers. Another widely known fonii is the Haggerty-Olson-Wickman Be- 
havior Rating Schedule. Questions are illustrated in Figure 86. This scale is 
used for recording opinions about the behavior and adjustment of children. 
Unlike purely descriptive scales, this is scored to yield a measure of behavior 
difficulties. Weights were determined by identifying the frequency of each 
rating in groups of children known to have behavior problems. Both of the 
scales shown require rating exactly at one of five positions. A graphic scale 
permits the judge to place checks anywhere along the scale. 
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Is he abstracted or wide awake? 


Continually 

Frecpiently 

Usually 

Wide- 

Keenly 

absorbed in 

becomes 

present- 

awake 

alive and 

himself 

abstracted 

minded 


alert 

(5) 

(4) 

(2) 

(1) 

(3) 

Is he shy or bold in 

social relationships? 





Painfully Timid, 

self-conscious Frequently 

embarras.sed 
(4) (2) 

How does he accept authority? 

Self-conscious 
on occiisiuns 

(1) 

Confident 
in himself 

(3) 

Bold, 

Insensitive to 
social feelings 
(5) 


1 

Critical of 

1 

Ordinarily 

1 

Respectful, 

Entirely resigned, 


authority 

obedient 

Complies 
by habit 

Accepts all 
authority 

(5) 

(4) 

(3) 

(1) 

(2) 


Fig. 86. Items for the Haggerty-Olson-Wickman Behavior Rating Schedule.' 


16. In the American Council rating scale, the trait .scale for leadership (C) is 
defined by five specific phrases. Wliat advantage does this scale have over 
a set of adjectives such as “Excellent,” "Good," “Average,” “Poor,” “Un- 
satisfactory”? 

17. Why might keenly alive children have more behavior problems than those 
rated as wide-awake? Would "keenly alive and alert” ordinarily be considered 
a sign of poor mental hygiene? 

18. The phrase “very favorable conditions for rating” was used in paragraph 3 
above. What type of condition.s (apart from the scale itself) influence the 
precision and validity of ratings? 

The forced-choice technique applied to rating scales js simi larjo the f orced 
choice fo und he lpfaldH- self - r e p 9rt4est s. It asks the jud ge to^h oose which ^ 
two equally fa vo rable a djectiv es best a ppli^ to the subject. No matter what . 
^lo or generosity effects enter the ju dg e's thinking, .such a que.stion forces 
hhn .tQ CQnsidfx7the7Aub|ec.t. in ^ajd^to tlic specific ttyt„in. question. After 
deciding whether Jones is more calm, tlian cautious, rnnre friendl y fhan in- 
Jglligent, more creative than painstaking, and so on, the judge has provided 
Jhe bestpiciuifiJiis Jdmwledge perinits of tire ch^cteristics of the subgctr” 
This does not, however, yield a score on general merit, since every subject 
is parked on every pair of item^s. T he device was used by the Army in scales 
not released for publication (28). The forced-choice scale was supplemented 

’ Reprinted by .spodal permission from Haggerty-Olson-Wickman Behavior Rating 
Schedules. Copyright by World Book Company. 
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by a single general rating scale asking for the judges’ opinion on the over-all 
merit of the person rated. Such scales were used experimentally to place men 
in assignments. Point scores were obtained by weighting the adjectives in 
each pair according to their correlation with success in a given job. 

THE SOCIAL AAATURITY SCALE 

vQne tool of importance in evaluating normal behavior is the Vineland So- 
cml Maturity Scale.” This is ordinarily used as an adjunct to the clipicS inter- 
view, j:ather than as a means of summarizing field observations. It is a stand- 
ardized list of behaviors which permits a quite objective comparison. .oLthe. 
su bject’s , behavior, wit h a norm . It is an age scale , somewhat analogous to the 
Binet test, listing behaviors characteristic of ages from birth to adultliood. 
The scale se eks to meas ure degree of maturation of so.cial competence, shown 
in personal responsibility and social participation. Items from two of the age 
levels are as follows: 

"Age II-III 

“Asks to go to toilet. 

“Initiates own play activities. 

“Removes coat or dress. 

“Eats with fork. 

“Gets drink imassisted. 

“Dries own hands. 

“Avoids simple hazards. 

"Puts on coat or dress unassisted. 

“Cuts with scissors. 

“Relates experiences. 

“AgeX-XI 

“Writes occasional short letters. 

“Makes telephone calls. 

“Does small remuner.ative work. 

“Answers adsj purchases by mail.” 

In using t he scale, the exa miner interviews an informan t who i s th orou ghly 
a cquainted w i tli the child’s" performances , a nd marks the scale accord ingly. A 
t otal score is computed which is called a “ social age.” corresponding to the 
“mental ag e” of Bi net-tvpe scales. A social quotient (SQ) may be derived^ 
The tool is useful in research on social development under varioulTkmds of 
environment, and is bolpfiil in clinical analysis at all ages. 

PROBLEMS OF RELIABILITY 

nbypjy ff Hnri!! try-impartial observers are g ene r ally accep ted as v alid if 
tlifiv can be made relia ble . Observations are more often used as criteria than 
a s predictors, and empirical studi es of tireir v alidity a re u ncommo n. The log- 
IcalTOlidity of observation data is easily accepted if errors of measurement 


“ Publisher: Educ. Test Bureau. 
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are eliminated. Tendency to quarrel i.s surely measured by observed inci- 
dence of quarreling. Tendency to seek solitary activities is validly revealed 
by a full record of how one spends his free time. The questions that may be 
asked regarding observations in normal situations reduce to three: How close 
is the agreement of difterent-judg^sjnakm^ the sam^e observation? How-great 
is the correspondence between observations in “similai'” situations? HpwJ[ong^ 
must the, person be obseiwed to obtain a reliable measure?. 

Reliability of a Judge 

The reliability of an observer hinges, on the method of controlling observa- 
tio n and the method of .reporting. Errors occur in observing through inatten- 
^tion of the judge or his selection of things to observe. in. reporting are 

„at a minimurn when he has made objective notes on specific facts observed. 
Errors rise when he makes judgments, as in a rating scale. 

Ratings casually obtained are notoriously unreliable, as judges are incon- 
sistent in repeated ratings and disagree with each other. Many prediction 
studies have been unable to obtain high validity coefficients even with good 
predictors, when ratings were used as criteria. The reliability coefficients for 
judgments based on general acquaintance are frequently .40 or .50, as is il- 
lustrated in Table 66. Single unconfinned ratings should never be depended 
upon. Ratings may be made more reliable by combining the opinions of sev- 
eral judges. Symonds estimates that the correlations of average ratings can 
reach .90, when the average is based on eight judges (30). It is rarely pos- 
sible, however, to obtain so many judges who know the person well. 

The reliability of rating is greatest for those behaviors which can be ob- 
jectively obseived and which do not involve interaction with other&^J^raits 
relia bhLjated include originality and energy. Rehabilityis least for genera.1 
attributes, especially in the field of character, and for traits dealing with in- 
terpersonal rela^qns. Traits which are unreliably rated include integrity^p- 
Qpexatlyeness, and kindness (14). 

Although ratings based on incidental impressions are quite unti'ustworthy, 
judgments can be very reliable when based on systematic observation. New- 
man reports reliability coefficients for combined ratings by three trained ob- 
servers ranging from .74 to .90 (22, p. 30). Similarly high correspondence was 
obtained by Conrad and by Anderson, referred to earlier. 

19. Why might integrity, coopenitiveness, and kindness be especially hard to 
rate reliably? 

20. Which of the following traits would probably be hardest to rate reliably after 
observations: talkativeness, freedom from tension, freedom from anxiety, 
leadership, and sensitivity to others’ opinions ( 14, p. 32) ? 

21. The Spearman-Brown formula used for estimating the reliability of lengthened 
tests (see p. 61) may also be applied to predict the reliability of combined 
judgments of several equally competent raters. Apply the formula to de- 
termine the reliability to be expected if the ratings of health and initiative 
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(Table 66) had been based on the average of three reports, instead of only 
one. 

22. Ratings on leadership made at Officer Candidate School correlated only .15 
vvith ratings on efficiency of combat leadership by superior officers in the 
field after the men had been observed in combat (15, p. 72). How can the 
smallness of the correlation be accounted for? 

Consistency of Performance 

There is no universal answer as to when observations in diflFerent situations 
will, give similar pictures of the individual. Studies of personality assume 
tliat there is an underlying unity which causes a person’s behavior to be con- 
sistent in similar situations. One must be cautious in assuming tlmt situations 
which look comparable are really the same for the subject. Situations which 
seem to relate to the same trait may actually call forth very different be- 
haviors from the same person. 

The best evidence on this problem is Newcomb’s observation of boys in a 
summer camp ( 21 ) . Daily records were made of many particular responses, 
such as cooperation in after-meal work, fighting with other boys, and per- 
sistence in morning activities. When these day-to-day records were studied, 
most boys were found to be inconsistent, in the sense that they acted in an 
extrovert fashion half the time, and in an introvert fashion on the other oc- 
casions. As Newcomb points out, the situations are only superficially alike. 
“Wliether or not Johnny engages in a fight may depend on whether or not he 
thinks he can ‘lick’ his opponent” (21, p. 39). The apparent inconsistency of 
Johnny’s action, from an observer’s frame of reference, may be highly con- 
sistent from Johnny’s point of view. Correlations were computed between ob- 
served behaviors which were supposed to represent single traits, such as 
showing off or dominance over peers. Behaviors grouped within one of these 
.supposed traits correlated little higher than obviously dissimilar behaviors. 
Another study found only trivial correlations (median .20) between punctu- 
ality observed in different situations. When scores were grouped in broad 
categories, some consistency between situations was found, but too little for 
refined prediction (9). 

It is clear that conclusions about a person formed in one situation — even on 
the" basis of many cumulated observations — are valid only fpr that situation. 
Inf erenc e a.s tpjiqw he would act in another sitiiation is, yvarranted only when 
responses under the two conditions have been shown to be correlated, or 
when observation yields so much understanding of his underiyingpersonality , 
structure (/its frame,, of reference) that we can see what a new situation 
means to him. 

Sampling Required 

Synionds has commented emphatically on the need for adequate sampling 
(30, p. 5) : “A single observation is unreliable, a single rating is unreliable, a 
single test is unreliable, a single measurement is unreliable, a single answer 



404 


ESSENTIALS OF PSYCHOLOGICAL TESTING 


to a question is unreliable. Reliability is achieved by keeping up observations, 
ratings, tests, questions, measures; ... If you ask one teacher for her judg- 
ment of a boy’s trustworthiness, you obtain what she has been able to observe 
in those few narrow classroom situations that appeared when her attention 
was particularly directed to some act involving honesty. An adequate rating, 
on the other hand, requires the judgment of several raters in several situa- 
tions at several different times. Reliable evidence is multiplied evidence.” 

The extreme variation in performances is suggested by the data in Table 
67, based on obsei’vations of airplane landing. These data were gathered dur- 


Table 67. Reliability of Observations of Airplane Landing (29) 



N 

Zone of 
Landing 

Landing 

Attitude 

Dropped or 
Bounced 

Correigtion of observers viewing some landing 

304 

.83 

.79 

.88 

Correlation between two landings on same day 

340 

.12 

.32 

.32 

Correlation between two loadings on different days 

340 

.02 

.04 

.00 


ing a test of ability, rather than observation under normal flying motivation. 
If the test motivation has any effect, however, it should be to increase con- 
sistency of performance by raising each man near to his maximum. The ob- 
server rated each landing on place of landing, attitude of plane (three-point 
or wheels first), and dropping or bouncing. Different observers agreed well 
in making the records required. Retests, however, showed almost no consist- 
ency of performance, and this consistency dropped even lower when tests 
were made on different days. The investigators blame part of the erratic 
changes on variations in wind, turbulence, and other external conditions. A 
single measure of landing behavior, no matter how reliably judged, gives 
little or no information of real value. 

How many obsei'vations tire required to obtain reliable data depends on. 
the problem studied., The experimenter can measure reliability of sampling 
by correlating ratings from “odd” with “even” observations. By this means, it 
was determined that 24 or more five-minute time samples permitted “reason- 
ably stable” estimates of individual differences in preschool children — except 
for a few children where unusual conditions altered behavior during some 
observations (4). Such studies do not permit a general statement as to the 
number of observations required. They do, however, show that several ob- 
servations of each person are needed, and that many shoit observations are 
superior to a few longer samples of behavior. 

23. How many landings must a man make, on a given day, to yield a “zone of 
landing” score having a reliability of .70? (Apply Spearman-Brown formula.) 

24. As a criterion in selection research, would it be better to test every flier with 
repeated landings on tlie same day or with a similar number of landings 
spread over several days? 
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INCIDENTAL SOURCES OF DATA 
Personal Records 

In ad dition to observation, students of personality have been able to rely 
01^ svidence of .tyi)ical behavior from incidental sources, such as personal 
rerards and products. As a person goes tln-ough life, he leaves behind him 
8-_h5il -Of-Objective data which tell much about his personality and habits. 
With effort;- the investigator can often unearth such records and use them as 
indices of t ypical behavior., These records. are impartial, quantitative, and 
u^ally highly valid measures of the behavior they reveal. In industry, time 
car3i punched routinely form an index of punctuality and absenteeism. Rec- 
ords of the nurse provide information on frequency of accidents. Production 
records often indicate both the quantity of output and the percentage of out- 
put rejected by inspectors. Obviously such data are less biased than the sub- 
jective reports of foremen. An example of tlie use of production data is the 
work of Roethlisberger and Dickson (26). To test the effect of experimental 
working conditions, each worker in the test group assembled relays as usual, 
excejit for the changes in ventilation, motivation, etc. being studied. The out- 
put of each worker, instead of going into a pool of finished parts, went into an 
identified box. The number of rejects and the number of satisfactory relays 
provided important data on the behavior of the workers (cf. p. 310). 

Outside the job, the citizen tells about himself in his record of traffic cita- 
tions and acSi^nfs” of payment of Bills, arid of marriage and parenthood. 
These data are rarely as useful as those gathered under uniform conditions 
for all people, but they are at times important. Accident records, for example, 
have been important criteria in studies of di-iving safety. School records are 
rich in data: records of absences, of extracurricular participation, disciplinary 
referrals, assignments completed, and so on. Such data also provide informa- 
tion about the typical behavior of teachers. If an investigator finds that Miss 
Franklin gives twice as many A’s as Mr. Park, and that Mr. Park refers twice 
as many diildren to the principal, important hypotheses may be formed. The 
v Malidity of oyective-dfita can generally be accepted, but one must avoid qver- 



25. A traffic policeman writes 40 percent more tickets than normal for the force 
during a year’s work. What factors might make it impossible to draw con- 
clusions from this regarding his alertness or severity? 

26. On what traits might a man’s checkbook stubs provide data? 

Personal Products 

A man’s habits are revealed in the objects he leaves behind him. The man- 
uscripts of composers (and the music itself) reveal the temperaments of the 
artists, Beethoven’s stormy impulsiveness appearing in bold, irregular pages, 
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Mozart’s delicate spontaneity appealing in neat, vivacious strokes (34). The 
notches on the gun of the frontiersman report not only that he accounted for 
so many Indians, but also that he considered it important to record the num- 
ber. Examining the personal products of the subject provides evidence of 
characteristics from quality of handwriting to schizophrenia.. Because these 
data have usually been produced for purposes quite unrelated to the subject 
of investigation, they are generally good indicators of typical behavior. When 
produced under special test motivation, they indicate maximum ability in- 
stead. 

Graphology, the study of handwriting, is receiving increasing attention 
( 2 ) . It is one of the few techniques whicli can be used to study the past per- 
sonality of subjects, since samples of handwriting can usually be obtained for 
the subject at all stages in his development. 

Personal products may be collected in school situations very easily. 
Themes, artistic productions, and answers to essay examinations are usually 
available. The wise teacher of grammar judges her success by studying the 
papers her pupils write for other teachers, where they are trying to express 
ideas rather than impress the English teacher. If obseivation is impracticable, 
one can judge a teacher far better from her lesson plans than from her answer 
to a question about how one should teach. Techniques of work in shop and 
art are frequently judged by the product. Although this source of evidence 
has not received widespread use because of the scarcity of data, it is of po- 
tential value for evaluating habits of neatness, accuracy, problem solving, and 
many other educational objectives. Qutside the school, relatively little use has 
been made of products, although documents such as diaries, letters, and 
other papers are sometimes helpful to the psychologist (1). Procedures for 
improving the rating of products have been discussed in Chapter 12. 

The use of personal products for insight into personality is a relatively new 
technique. It has become increasingly important as teachers, employers, so- 
cial workers, and therapists have sought to understand tlie motivations that 
lie beneath symptomatic behavior. Virtually every act a man performs bears 
the imprint of liis unique personality. Wolfi, in his experimental depth psy- 
chology, has e.stablished significant relations between personality and hand- 
writing, gait and voice (34). Artistic productions express feelings and action 
patterns; valid analysis has been made of products ranging from masterpieces 
of famous artists to the casual doodling of the common man. As in the pro- 
jective teclmiques to be discussed in Chapter 20, it is important that the 
prod ucts to be studied should have been the free responses of the subjectj.If 
one is given a rigidly specified task to perform, he has liftle'bppoftunity to 
impress his personality upon its design or execution. Insofar as he is free to 
choose what he wishes to produce, he gives the analyst more information. 

The analysis of personal products follows closely .the procedures and 
theories u sed in projective techniques. Here it v/ill suffice to mention that 
study may be made of content, style, and method of- attack., Interpretation for 
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a specific purpose may stop at superficial description of such traits as neat- 
ness, or it may set up hypotheses based on psychoanalytic or other theories 
of personality. 

27. What assumption is made when an English teacher judges the typical gram- 
matical habits of her pupils from their papers in science and social studies 
classes? 

28. A talented seven-year-old taking a special art class has produced the fol- 
lowing set of spontaneous drawings over a period of several weeks. Quotation 
marks indicate his comments about them. 

Out West. “All tile Indians ai'e dead. The cowboys are raising the flags. 
The cows are dead, too.” 

Man and King. “The last man wants to be king so he is kicking the king 
in the pants.” 

End of the Birds. Boat, with hunters shooting birds. 

The Fish and All. Fish, bubbles, turtles, etc. 

The Clown. “The clown is selling balloons and lollipops. He is hungry too.” 
Gas Station. Gas station with man and cars. 

Ford Station. Ford assembly plant with cars and ramps. 

Kottonton. Religious story. Shows little Kottonton chasing Satan, who calls 
for help. 

King. “Fat king has a rake in his hand to chop up everyone. But he is 
alraid of them.” 

a. What ideas or themes seem to recur in these pictures? Which pictures 
seem to be most highly individual in content, unlike those of other seven- 
year-olds? 

b. What do these titles and comments suggest regarding the content of his 
mind? 


STUDIES OF REPUTATION 

Injportant data on reputation — ^how others view an individual — ^are ob- . 
tained by socioinetric techniques. These methods .are -neither observations., 
Tioirself-repbfts. They resemble most closely the rating techniqu e, but reports 
are $y f^ e subjept’s peers rather than his superiors . One of the simpl est 
■socio metric m ethnds is the “Guess Who” device. .This was fi rst used by Harts- 
h orne and May wit h school child ren. The test consist s of des crip tions of role s 
played by"cliil3ren. Each child in the group responds to each description by 
naming any child he thinks^that description fits. How other children view 
Walter is indicated by the description to which Walter’s name is repeatedly 
matched. Typical descriptions are ( 12, p. 88 ) : 

Here is the class athlete. He (or she) can play baseball, basketball, tennis, can 
swim as well as any, and is a good sport. 

This one is always picking on others and annoying them. 

N o on e, assumes that reputations are true reports of perspnality 4 ._Hpw: Q.ne|a. 
as sociates view him is itself significant evenyyhen they rnisjudge him. Repu- 
tations do in general correspond to behavior. Correlations, corrected for un- 
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reliability, between reputations among adolescents and careful observations 
of corresponding behaviors range from .45 to .70 ( 22, p. 39 ) . 

The sociogram is a device for studying the social structure of groups,^ Char- 
acteristics of an individual, including his popularity, may be studied by the 
Guess Who method, but the sociogram gives further insight by identifying 
chques,, hierarchies of leadership, and, other social groupings. The sociogra m 
was developed by Moreno (19). Although the technique has been amended 
in various ways which sacrifice eflFectiveness for convenience, the besjt .pro; 
cedure is to request members of a group to indicate their choices for com- 
panions in a particular activity. Jennings cites these illustrative dmections tc 
be usedjry a g roup leader for obtaining sociometric data ( 16) : 

“Each of you knows best whom you would enjoy being with in the same group 
at our community center for the times you will be working and playing together. 
No one can know this as well us yourself. We shall be arranging our new schedule 
for groups next Monday. As today is Friday, I can figui-e out the membership in 
groups by Monday if you would like to choose associates today. We will keep the 
same people we choose today for eight weeks, and then have a chance to choose 
again. Keep in mind all the boys and girls you have come to know here whether 
they are absent today or not. Let’s give three choices, or four if you like. Wher- 
ever possible. I’ll arrange the groups .so that the individual gets all his choices. 
But is is very difiicult to give all people all their choices because lots of people 
might choose one person.” 

While the wording is varied, effective directions deal witli a real g roup, 
,wbo make clroices which will be followed up. The da ta are not obt ained in a 
test settingiJijstead, they are obtained as ,a ineans, of dealing with t he group ^ 
If data are obtained from a less real question, such as “Who are your friendsr 
there is more likelihood of answers given to make a good impression. Sub- 
jects must know that their reports will be heated confidentially. The socio- 
metric data should be used as promised, to set up work groups, committees, 
homeroom seating, or whatever; this permits one to obtain cooperation when 
the technique is used again at a later date. 

When the choices are obtained, they are plotted in a sociogram, Figure 87 
is the Sociogram of a class of fourth-grade girls early in the school year. Pu- 
pils indicated not more tlran three choices, and were also permitted to list 
any whom they would not choose, "niis sociogram reveals several -patterns, 
often encountered. There are two groups or cliques. In one Emily is the most 
sought-after person, with Jane, Lenora, Caroline, Rhoda, and Louise as ac- 
cepted members. In the other group, Agnes is the key figure, with Lurline, 
Patricia, and Ann as memhers. Patricia, one notes, is not tlioroughly inte- 
grated with the clique; while accepted by Agnes, she is also reaching toward 
Emily in the other group, rather Aan Lurline or Ann. Agnes, who might be 
a popular leader of all the girls, instead shows considerable hostility, reject- 
ing three popular girls. Ella is a fringer, not chosen by any of the others, 
Tess is even more isolated. 
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Fig. 87. Sociogram for a class of fourth-grade girls (after 27, p. 297). 


29. What children besides Tess and Ella are fringers? 

30. What interpretations of Agnes’ hostility can be suggested? 

31. Prior to this study, the teacher had characterized Tess as hard working, in- 
terested in accomplishing tasks, “fits in nicely with the group." Tess helps 
others with their sewing, at which she is superior. How would the teacher’s 
outlook and treatment of Tess be affected by the information from the 
sociogram? 

32. Plot a sociogram using the following choices made in a group of tenth-graders. 
Discuss the interactions shown. 

Shirley chooses Charles, Jim, and Sam. 

Charles chooses Shirley, Sam, and Jim; rejects Tom. 

Phil chooses Jim, Charles, and Shirley; rejects Wallace and Tom. 

Wallace chooses Phil, and Jack; rejects 'Tom. 

Jim chooses Jack, Sam, Charles, and Shirley; rejects Tom. 

Jack chooses Jim and Tom; rejects Phil. 

Shii-ley is chosen by several girls whom she does not mention. Sam and 
Tom were absent. 

33. How does the sociometric method differ from the common secret popularity 
ballot? 


ample, if sorority girls are asked'to indicate their c hoice f^ -j-ooinjnates, and 
their choice of persons wjth whom to st udy, the sociomefric patterns will, 
differ. A best friend may be thought of as too noisy or untidy for a good 
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roommate, and a girl quite unpopular in these respects may be regarded as 
an excellent helper on school assignments. Basic social configurations arc 
fairly stable when difierent questions are used, but one cannot assume that 
the interpersonal sti'ucture of a gi'oup is the same under all conditions. Tlie 
sociometric structure also changes with time. By December, Agnes was a 
“star” in her group, along with Rhoda and Emily. The cliques had disap- 
peared, thanks to the skill of the teacher. Ann and Lurline still chose each 
other, but Agnes now turned her back on them, ignoring Ann and rejecting 
Lurline. 

Social structure may also be studied by observing contacts and rejections, 
rather tiian reports of peers. Tliis is especially helpful at the preschool level 
where verbal choices are difficult to obtain ( 18 ) . 

Sociograms have been used practically in a variety of ways. Roethlisberger 
and Dickson applied the technique to organize factory workers into more 
congenial and effective groups. The AAF formed teams by sociometric meth- 
ods to operate airplanes. Teachers have found the sociogram an important 
method of understanding relations among pupihLSD that th e school can pro- 
mote activities which will broaden contacts between c hildren , develop lead- 
ership, and break down cliques and isolation ( 25 ) . Moreno used the method 
to assign girls in small living groups in an institution for girls committed by 
the Children’s Courts (19). In the OSS, sociometric metliods were used in as- 
sessment. Groups of candidates undergoing study lived and worked together 
for three days. After that close association, each man rated and described 
his fellows. These data gave evidence of ability to inspire confidence and 
friendship which would be crucial in many of the assignments to be given 
accepted men ( 24 ) . 

34. When sociograms were made of squadrons of Navy fliers on combat duty, 
it was found tliat the “administratively designated leaders” were often not the 
"natural leaders” in the group, i.e., the ones most chosen as work leaders (17, 
p. 133) . What practical suggestions follow from this finding? 

35. In a group of sorority girls, the sociometric question, “Whom would you choo.se 
as a roommate?” is asked; will results be the same if the question is changed 
to, "With whom would you choose to go on a double date?” 
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CHAPTER 19 


Observation in Test Situations 

GENERAL CHARACTERISTICS OF SITUATIONAL TESTS 
Comparison with Other Methods 

The-assessment of typical behavior by self-report tests and by field observa- 
tions. Jias not been entirely satisfactory.. Most self-report tests must assiune, . 
however dubiously, that the subject describes himself lionestly. The strictly 
empirical self-report test escapes this limitation, but even here variation in 
frankness introduces a variable error of measurement. Moreover, the em- 
pirical test does not describe the subject, and may be used only for the lim- 
ited pm-pose for which it has been validated. 

Observation in normal situations escapes the errors of self-report, only by 
introahcu^-iTTarked observer errors._^Field observation is frequently imprac- 
tical because of the large amount of observing required for reliability, and 
has the great disadvantage that it is impossible to compare subjects on traits 
not normally evidenced in their daily activities. If, for example, an investi- 
gator intends to study individual differences in behavior after long periods of 
wakefulness, he can gather little evidence by field observation of workers or 
students. Only by setting up an artificial situation in which each person is 
kept awake for a long time can he observe how the person reacts to such 
fatigue. 

Obsei-vation during test situations has advantages over both the other, 
methods. This technique has been illustrated frequently in earlier chapters, 
■"Since one of the major advantages of any individual test is the opportunity it 
affords for studying individual modes of reaction. Whereas, vrith the Binet 
and similar tests, observation is incidental, the “situational tests” discussed in 
this rhnptp-r are .designed p rimarily to provide opportuiiity.jEoi:i)hserY,ajiQlu,. 
The significant features of a situational test are as follows : 

1. T he perso n is subjected to a stimulus situation which is made as nearly 
unifor m as po ssible for all subje’cfs'i' 

situ ation is' designed to per mit yariation in those types of behayi.qr 
v^ich the tester wishes to observe. 

"STlTie tesfTssd designed t hat fli e subject believes one characteristic is be- 
ing tes ted. the observer is actually 'oB'serving some other aspect of 

pe rform ance. 
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4. The observer, or a recording device, makes careful records of the sub- 
ject’s method of performance, rather than noting only the amount performed. 

Not every situational test utilizes all these features. 

An illustration of the technique is the Observational Stress Test of the 
A AF, used to assess men entering pilot training on their ability to resist dis- 
__tcaction and confusion under pressure. An apparatus test is administered in 
such a way that each candidate is subjected to standardized stress-producing 
stimuli. The apparatus consists of a set of seven controls ( foot pedal, tlmottle, 
stick, and various levers) which the examinee has to reset continually as 
signal lights change and buzzers sound. The total time required to react to 
each signal is recorded electrically. The examinee is told that he will be ob- 
served by a concealed observer “just as a check-pilot will rate you in flying.” 
The administration of the test is standardized: one minute of rest and an- 
ticipation, one minute of directions regarding signals and controls, and three 
short test periods. In each period, the examinee is given a set of increasingly 
reproving “stress directions” while he is busily moving levers. The patterns of 
signals are made more complex. In test period C, the pattern of lights changes 
six times, after intervals of 15, 20, 15, 15, 10, and 40 seconds, while the ex- 
aminer is delivering the following speech in an urgent maimer: “Don't make 
lights flicker on and off. Be steady. . . . Quit making errors. You aren’t mov- 
ing fast enough. . . . More speed. . . , Hurry and stop the clock. . . . Last 
chance. ... Set controls quickly. . . . You are still making errors.” The 
concealed observer, meanwhile, makes extensive ratings of behavior consid- 
ering general manner (poised, blocked, confident, etc.), reactions to criti- 
cism (obedient, ignores it, annoyed, etc.), and other characteri.stics. In addi- 
tion, a thumbnail sketch of behavior is written, and objective clock scores 
showing performance for the .successive periods are recorded. A final cliTiienl 
rating, arrived at from all these data, correlated .35 with success in pilot train- 
ing (cf. Table 49). Evidence is incomplete whether the clock scores, an ob- 
jective measure of perfoimance under sti-ess, are a better or poorer predictor 
than the clinical assessment ( 5, pp, 660-663 ) . 

The greatest advantage of tlie test observation is that it makes possible the 
observation of characteristics which appear only infrequently in normal ac- 
tivities — characteristics such as bravery, reaction to frustration, and dishon^ 
esty. A single situational test may reveal more about such a trait dian weeks 
of field observation^Sjeepnd, the subject’s.desire to make a good impressioj^ 
does not invalidate the test. In fact, just because he is anxious to make a good 
impression, he reveals more about his personality than would normally ap- 
pear. It is necessary, however, to take tliis motivation into account in inter- 
preting results. The third advantage of the situational test is that it comes 
closer than other techniques to a standardized measure of typical behavior. 
Some tests place great reliance on the objective results, whereas others are 
used more as a basis for clinical and impressionistic evaluation. In either case, 
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the test observation gives a better basis for comparing individuals than ob- 
servations in which different subjects are seen in different situations. Errors 
of observation and sampling are still present, however. 

Uses of Situational Tests 

pdooipal uses of situational tests have, been for research in character, 
^od similar characteristics that are difficult to observe. Practical 
use has been limited because such tests are difficult to design and administer, 
but they have been found helpful in severaT'^pes 'Of military assessment. 
German military psychologists emphasized situational tests heavily, and such 
tests were adopted by the American Office of Strategic Services to select per- 
sonnel for mihtary intelligence duties. Use in other branches of the service 
was slight. Clinicians have adopted many tests primarily because of the ob- 
servations they permit; such tests are especially helpful in studying thinking 
habits and reaction to emotion-producing situations. 

Structured and Unstructured Situations 

•JTl^ principal difference between observational tests and tests for other 
purposes is that the former are more unstructured. This distinction is espe- 
cially important in connection with projective tests ( Chap. 20 ) . Structuration 
is a helpful concept, if not taken too literally; most situations are somewhat 
unstructured, and no rigid classification can be made. A situation is struc- 
.tured if it has for all subjects a definite meaning. An unstructured situation 
. presents so few cue,s..to the subject that he can giye it almost any meanmg he. 
-Wishes^ "No Admittance” .on a door is stmctured, if one knows English; the 
phrase suggests a definite response. If the same words were written in Chi- 
nese they would give the person almost no cue to guide his actions. A fa;; 
miliar unsti-ucture d stimulus is the strange sound in the night. Is it the wi nd? 
a burg lary the cat? water dripping? The interpretation we make is strongly 
influenced by our interests, by fears conscious and unconscious, and, of 
course, by learned background. In a structured situation, the subject knows 
just exactly what he is expected to do and how he is. expected to do it. Ip the. 
unstmctured situation, we see how he behaves when he is permitted, even 
forsedjjto guide himself. The more ambiguous the situation, the more oppor- 
tunity the subject has to reveal his own individual method of interpretation 
and performance. An example of the extremely unstructured situation is 
Waehner’s procedure for studying personality. She observed her subjects’ be- 
havior and products after turning each one loose in a studio equipped with 
all types of art media and materials, with little more instruction than “You 
may do anything you like with these” ( 20 ) .-The stress t est for pilots was more 
structured, _smce the man knew much of wliat he was to do; but he. was left 
^reel:o show confusion or emotion, since he did not know that these were be- 
ing considered in evaluating him. 

.Highly shmctured situations are excellent, as measures of ability, just be- 
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cause thpy fnrnpi p.vprynnfi pf»rfnr)Ti Hih same task. When a test of ability 
permits variation in work methods, observations become more informative 
but scores become less valuable, since subjects’ scores change under different 
work methods. 

Projective situations, at the opposite extreme from tests of ability, use al- 
most totally unstmetured stimuli. The projective test is so named because it 
permits the subject to project into the situation his own behavior tendencies, 
and to reveal his unconscious thoughts, wishes, and fears (3). Thus the 
householder who interprets the creak in the dark as a burglar may be reveal- 
ing that he is more anxious than another man who interprets the same stim- 
ulus as a natural phenomenon and goes back to sleep. 

1. In each of the following situations, discuss whether it would be preferable to 
employ observations in natural conditions, or standardized ob.servations where 
conditions are fixed in advance, identical for all subjects. 

a. The telephone company wishes to rate its operators on courtesy and clarity of 
speech. It is able to tap conversations and make recordings. 

b. It is desired to screen Navy personnel for tendency to panic under conditions 
of extreme noise, as in amphibious landings. 

c. An investigator wishes to study the habitual recklessness of seven-year-old 
boys in climlung and jumping. 

2. Discuss to what extent each of the following may be considered an unstructured 
stimulus situation. 

a. A teacher, during a test, glances up from her desk and barely observes a 
hasty movement of one boy who is pulling his hand into his lap from the 
aisle. 

b. A group of college students are appointed by the instructor as a committee 
to prepare a panel discussion on compulsory military training and present 
it to the class. 

c. A group of people play duplicate bridge, the same set of hands being played 
at each table. 

d. A questionnaire is designed to obtain information about age, income, educa- 
tion, etc. All possible answers sue anticipated and presented on the blank in 
multiple-choice form. 

3. Discuss to what extent each of the following is unstructured. If the situation is at 
all unstructured, discuss whether that characteristic is an advantage or a dis- 
advantage. 

a. Stanford-Binet test, Memory for sentences. 

b. Stanford-Binet test, Picture interpretation (Fig. 20). 

c. A test of addition which presents in random order the combinations up to 
9-1-9, the pupil being directed to do as many items as he can in the time 
allowed. 

d. In the Porteus test (Fig. 1), die subject is to solve a maze. The time it 
takes him to trace the correct path with his pencil is scored. 

PROCEDURES STRESSING QUANTITATIVE MEASUREMENT 
hlutneEcm-teslte investigating typical behavior haye been- strictly -objg^ive 
in conception, yielding '^nuineffcaT .score representing- the behavior, in que s^ 
-tion, .These. JOay be contrasted with tests, leading to qualitative evalua tion of 
.. personality , which are discussed sepa rately. 
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Tests of Character 

The most widely known objective measures of typical performance are the 
tests designed for the Char acter Education Inquiry..QfJBaa'tshomfi. and May 
(7, 8, 9). The tests in this series represent the only extended effo rt to evalu- 
ats_peispnality by strictly quantitative and objective means. The measures 

CIRCLES PUZZLE 
First Trial 

Wait for the signal for each trial. Put the point of your pencil 
on the cross at the foot of the oval. Then when the signal is given 
shut your eyes and pul a small cross or X in each circle, taking 
them in order. 



Hold the page HERE with your finger tips 


Fig. 88. An “improbable achievement" test of 
character (7, p. 62). 


were designed for an investigation of the nature and development of char- 
acter, and have been more imp ortant ns a research method than in practi cal 


C hara cter traits ar e tho s e in whi ch society judges one lype of response as 
Djore ethical than another. In view of the sanctions surrounding such beliav- 
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ior, different individuals, can be validly ionapared only by subjectiiigjhem 
to equal temptation in a situation where they believe they can violate stand- 
ards without detection. Corey’s measure of cheating (p. 377) is an example of 
.such a technique, adapted from-tlie G.E.-L -procedures. Other traits ^udied_ 
by Hartshorne and May include tmthfulness, honesty with, .money, persist- 
ence, cooperativeness, and generosity. 

Honesty with money was tested in children by presenting arithmetic prob- 
lems in which each pupil manipulated a set of coins to obtain the answers. 
One box of coins, secretly identified, was provided for each pupil, and at the 
end of the work each pupil placed his own box in a pile in front of the room. 
Since pupils were unaware that boxes could be identified, many took ad- 
vantage of the opportunity to keep some of the money. 

Honesty in a situation involving prestige required the children to do an 
impossible task, such as placing marks in small circles while keeping their 
eyes closed ( Fig. 88 ) . Many children, feeling it vital that they do well, turned 
in papers showing many successes which could only have been obtained 
through cheating. 

Generosity was studied by providing each child an attractive, well-filled 
pencil box, with the understanding that he could keep all the items, or could 
inconspicuously donate any of them to a collection for less privileged chil- 
dren. 

4. Harry is much more generous than Fred on the pencil-box teiit, giving up nearly 
all the objects. Why cannot one conclude that in other situations Harry would 
also be more generous? 

5. If Bill does better than Fred on the circle-dotting test of honesty, what conclu- 
sions can be drawn about Bill’s character? 

6. Joe is a known delinquent, having gotten into trouble together with a gang of 
boys for several minor thefts and msturbances. How can you explain the fact 
that he does well on all the tests of honesty, cooperation, and generosity? 

Measures of Persistence and Endurance 

Sfaeugth ipf motivation has been of particular interest because it is thought 
of as the missing link between aptitiide and achievement. The employer, the 
teacher, the clinician, and all other users of tests would welcome a technique 
which predicted how much a persons behavior would bear out the promise 
shown in tests of ability. Although tests of motivation have not been adequate 
to meet this need, a few tests have explored possible techniques. 

Many measures of persistence have been devised for use in.Jiabo) 3 i.tQryJn- 
vestigations. The imial technique is to establish a situation in which the sub- 
ject is permitted toVorTas long as be will. If, after he has solved one or two 
puzzles, a subj,ect is presented pne which is difficult or Impossible, lEFtime 
he spends at work .on it may be takeiTasah index of persistence. Th elimita- 
tion of such a mea.sure is that its meaning is unclear ..Does a low “persistence” 
i59£?j0^icate,iudolence, realistic recognition of one!s own limitatiQios^jcuDESL 



OBSERVATION IN TEST SITUATIONS 419 

9i Does a high score indicate praiseworthy industry, slavish obe- 

dience, or unintelligent perseveration? 

A'sfmilar method is tlie story test of Hartshorne and May (8, pp. 292 if.). 
The pupils read a story which builds to a climax: “Again the terrible piercing 
shriek of the whistle screamed at them. Charles could see the frightened face 
of the engineer. . . Here tlie examiner directs them that to learn the end- 
ing they must read the difficult printed material tliat follows: 

CHARLESLIFTEDLUCILLETOHISBACK“PUTYOURARMSTIGHT 

AROUNDMYNECKANDHOLDON 


NoWhoWTogETBaCkoNthETREStle.HoWTOBRingTHaTErrIflEDBURDeN 

OFACHiLDuPtO 


fiN ALly tAp-taPC AME ARHYTH Month e BriD GeruNNing fee Tfee Tconi 
ING‘ 

The pupil separated each word with a vertical mark as he deciphered it; the 
amount deciphered became an index of persistent effort. 

The Downey Will-Temperament Test- was an early and ambitious attempt 
to test personality by a series of tasks related to handwriting. The test in 
its present form has been found unreliable and quite inadequate to reveal 
the basic temperamental qualities of the individual. For less ambitious 
puiposes, some of the techniques may be useful. The “motor inhibition” test 
is one related to voluntaiy effort; the subject is to write, as slowly as he 
possibly can, the phrase “United States of America.” Some subjects, highly 
motivated, can stretch out this simple act to three-quarters of an hour, with- 
out intermpting their writing. 

7. The test illustrated in Figure 89 is a test of self-control, requiring the pupil to 
work in the presence of attractive distractions. Does this measure the same moti- 
vational factors as the story-completion test of persistence? 

A different technique for studying motivation is the test of resistance to 
pain. T n tests of this typCj subjects are-subjected to unpleasaD.t stimuli^^such as 
a g radually increasing electrical stimulation (10). The subject is told that 
his ability to endure pain is being tested, and that he can have the current 
cut off whenever it becomes unbearable. This becomes_a measure of how 
.muph a subject wants to do wel l, since t he p ain in such stud ies is kept be- 
jnw the level where serious physical damage would result . The test Js am- 
Struetured-only.in that the subject has no way of knowing how the paiti he 
is experiencing compar^to that other men were willing (“able”) to tolerate. 
He therefore is facea with the problem of interpreting for himself whether he 

^ From Hartshorne and May, Studies in. deceit. Copyright, 1928, by The Macmillan 
Company and used with their permission. 

^Publisher: World Book Company 
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Fig. 89. A test of self-control, or resistance to distraction (8, p. 308). 
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ti as pe r for med respectably:.., and caii -ask- to have the pain- ended without 
seeming. cowardly. Measures of this_sort-were tried- foi’ pilot selection, but 
. proved too unreliable lo'fuse. 

Frustrabilliy 

Quantitative measurement of reaction to frustration h.a_§. been less^ widely 
appli^ than qualitative assessment under frustrating conditions. Neverthe- 
less, a few objective, quantitative ihethods have been "tried (21). The com-, 
mon procedure is to measure, the subject’s ability to perform some type of 
such as cancellation or addition under normal test motivation. He is 
then expo sed- to a. standardized stimulus expected to induce frustr ation. 
Unsolvabie problems may be -pressnted,,OT a test may be given on w hich the 
■siiKject'ex pects and hopes, to do-.wzelU_such as an intelligene§,..t§.st» and after 
the p aper . has been scored, the tester tells him that his 5 .core jwas-low. A 
second test i^tiien given, similar to the one first given before, frustration. A 
marked rise or decline in score from. th& original test is taken to indicate re- 
sponse to frustration. 

General Findings from Quantitative Tests 
Evidence from many studies with all the above techniques seem to con- 
firm four generalizations.' 

1. ^Quantitative situation tests can.be highly reliable, . as measured, by cor- 
rfilations-bstween trials -in the same series. 

2. ' Irrelevant ability factors affect test scores and reduce validity unle,ss 

partialedout.- 

3. Tests presumed to be. equivalent. ja.fe,e.§ense. that they §gglL.tQ.IBfi^BF? 
the s ame trait, fre quently have low or negligible correlations, , 

4. There i.s no p ossibility of assuming, without direct validation, that a test 
pieggiirin g trnit X wiH predict bchaviof in a different situation where 
trait X is considered important. 

The Hartshorne-May studies with character were particularly illuminating, 
because it had been assumed until that time that character was a unit, 
something predisposing to generally good or bad behavior, and that gen- 
eralized "character training” would be effective. They fou.od that character 
traits could indeed be dependably measured^ correlations between deception . 
tests of the srnne type were .62 to .87, when^the deception scores were cor- 
rected for differences in ability. Moreover, the estimates were stable: tests 
of cheating showed self-correlations of .75 to .79 when retests were given 
six months apart (7, II, pp. 88-89), Jones has added the interesting firiding 
that when adults are tested for honesty, their scores correlate .37 with the 
scores they had earned as adolescents ( 12 ) , On the other hand, Hartshorne 
and May found that these excellent measures of deception did not correlate 
with each other; or, more precisely, the correlations between tests permitting 
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cheating with different sorts of material were too low ( .40 to .00 ) to warrant 
the conclusion that the various tests measured the same trait. The correla- 
tion between cheating on a classroom test and on a coordination test (Fig. 
88) was only .50 even after correction for all errors of unreliability. These 
data exploded the notion that honesty was a general trait which would 
be measured by any situation permitting dishonesty. Further correlations, 
between different character tests, were so low as to prove untenable the 
notion of a generalized character accounting for all desirable traits. Inter- 
correlations of honesty, persistence, cooperation, and so on were positive, 
but averaged only .24. This is evidence that the “general factor” in char- 
acter is of negligible size and has little relation to specific behaviors (9), 

Similar evidence was obtained by Thornton with tests alleged to measure 
persistence (19). He found that tests supposed to relate to persistence 
actually measured tliree different personality factors. One factor was will- 
ingness to endure discomfort or pain, found in the shock test and in another 
test requiring one to grip a dynamometer as long and as firmly as possible. 
An entirely separate factor was found in tests of what he called “plodding,” 
the keeping on at a task when no special immediate discomfort was in- 
volved. Tests showing this factor included the story test of Ilartshorne and 
May and Downey’s test of enforced slovmess in writing. Interestingly, 
three self-report questionnaires alleged to obtain descriptions of per- 
sistence showed negligible relations witli the actual persistence and endur- 
ance factors. Instead, they seemed to measure an entirely different factor 
of their own, presumably “tendency to claim to be persistent.” Thornton’s 
study supports the belief that there is a general tendency to persistence 
( “plodding”), but that some tests supposed to measure the factor actually do 
not. Moreover, in any single test of persistence, there is a large unicpie factor. 

It is not difficult to understand why different tests of a trait like honesty 
have low correlations. Some people need to do well on school tasks, be- 
cause that is an area where their aspiration is high and where they expect 
approval. The same persons may find no reason to cheat on a coordination 
test, which does not “mean anything to them.” Even within the same stimu- 
lus-type, generalization is difficult. If we leave a dollar bill on a desk, Jones 
may take it, and Smith may not. Jones succumbed to that particular tempta- 
tion more readily, but we may not conclude that he is less honest regarding 
money. Perhaps he needed money more, so that the temptation was not 
psychologically equal. If the amount had been twenty dollars, perhaps Jones 
would have left it on the table because of a realization that this would be a 
serious loss to the owner of the money, whereas Smith would have given 
in to the larger temptation. If it is impossible to generalize even from one 
controlled test to another of a similar type, it is clearly impossible to general- 
ize that a pupil honest in tests will show good character in adult business 
dealings, or that a pupil caught in a petty theft will be untrustworthy in 
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other matters. Similar cautions extend to all measures of generalized per- 
sonality traits. 

The experimenter who wishes to employ measures of specific traits en- 
counters difficulties when he wishes to toeat scores numerically. In a meas- 
ure of cheating, the person who earns a high score without cheating will 
have little motive to cheat. In persistence tests, the measure is at best a 
composite of motivation and ability. Thornton found that physical strength 
was a major factor permitting some subjects to earn higher scores than others 
in motor and endurance tasks. Logical analysis is quite important in inter- 
preting te.sts supposed to measure traits. 

I t sho uld not be concluded that tests of speci^c traits have no place. 
They are invaluable for many research purposes, and the findings of such 
research may. have practical significance. Mailer is authority for the statement 
that the Hartshorne-May findings on the nature and development of char- 
acter led one national agency working with youth to revise its program 
completely. One stimulus to the revision was the finding that those who 
had received most recognition in the agency’s character-building activities 
were on the average most likely to cheat (11, p. 201). This again is not 
hard to explain when we consider that striving for recognition in a competi- 
tive program, and need to obtain high scores even in a pirzzle test, may 
stem from the same basic insecurity and feeling of personal inadequacy. 

Symonds, in 1931, concluded that performance tests of conduct are so 
specific that tliey permit only four possible actions (15, pp. 352-354) : 

1. They may be discarded as too specific to be practically useful. 

2. Tests may be devised to correspond exactly to the situation where diey 
will be used. 

3. An estimate of a general ti-ait, such as honesty, may be based on a large 
number of tests which are specific but have some intercorrelation. 

4. Performance tests may be used empirically to predict any behavior with 
which they correlate. 

8. Analyze the story-completion test of pensistence, identifying all tlw factors 
which might cause one fifth-grader to earn a higher score than another. 

9. Children from homes with low socio-economic status cheated more on achieve- 
ment tests than other children (r = .49) (7). What factors must he taken into 
account before concluding that the.se children m-e more apt to violate standards 
of good conduct? 

PROCEDURES LEADING TO QUALITATIVE EVALUATION 

The Concept of the Indivisible Personality 
In Ae.fl?ld.£if personality, the tests just discussed are analogous to the 
measures of ability used by, Binet’s predecessors. The first stage in psycho- 
logicaTmeasurement, it will be recalled, was to try to isolate specific abilities 
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which could be precisely estimated, in the hope that a combination of these 
would reveal the significant features of man. In the trait approach to person- 
ality, investigators have sought to find a limited number of elements which 
could be used to map out and predict typical behavior. Except for limited 
types of prediction and research this approach has been found futile, be-" 
cause, behavior seems to draw upon an almost limitless number of specific^ 
traits, Behavior in any given situation is determined by the combination of 
features in that situation, not by general habits. Even Guthrie, sympathetic 
to the conception of personality as habit, comments, “The search for universal 
traits, or traits that attach to all of an individual’s behavior, is mistaken in its 
conception and bound to fail” (11, p. 63). Traits being specific, the trait 
method of measurement can only be useful for specific predictions. 

Investigators of personality have had higher ambitions, and rightly so, 
since the clinician, the educator, and others dealing with people must pre- 
dict their behavior in the wide range of life situations they daily encounter. 
A concMtion stemming from the work of Freud offers promise of accurate 
and useful analysis of personality. This point of view is referred to -undej 
a variety "of names: “psychodynamics,” “study of the personality as a whole,” 
‘*deptli-‘ psychology,” and so on. It is complex, and cannot be adequ atel y 
treated here. Probably the most adequate account for the person otherwise 
unacquainted with clinical literature is the summary by Mowrer and Kluck- 
hohn (11, pp. 69-135 ) . 

The dynamic approach insists that personality is more than a collection 
of specific conditionings to situations. Instead, it is beTieved fliat the hliany^. 
specific behaviors exhibit ah underlying unity. This unity is thought of in 
terms of "needs,” “complexes,” and other emotional forces. What the per- 
son does at any time can theoretically be predicted from his needs an d a 
knowledge of the forces in the field where he acts. In one situation he may 
be dependent, to escape criticisni fronFa "superior; in another, he may be 
dominant, to avoid exposing his insecurity to a subordinate; in a third, he 
may be cooperative and non-assertive, because he feels secure with that 
group of companions, "nie dynamic approach assumes -that ineonsistMiciesL 
in behav ior h ave consi stent causes, w hereas the trait apprpaph JQJU&t con- 
sider incon siste ncies as errors. 

The dynamic approach is, franld y q ualitative,, although scores are often 
employed in an intermediate stage of interpretation. The dynamic approach 
assumes' that personality is constantly developing, and seeks to predict how 
it will change, rather than to assume that future behavior will be!35sfily 
similar to behavior today. "The approach further pos'tulates_that o ne trait ca n 
only be understood in relation to all other traits, and looks toward a s}mthe.sis 
of all types of e vidence; Since both in f erenc e aiid svnth esTs"iiejiiaee9sai^^ 
goingTfom evidence of behavior to a description of personality dyna mics, 
the meth od is subjective. The quali ty of the res ults depends. -upotuihe 
ability and effort of the an alyst to a degree not found in other tests. This 
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is a tremendous disadvantage, since even if the method is good, it is good 
only in the hands of a user with “clinical insight ” And who is to judge which 
psychologists possess this intuition, and on what days it is functioning at 
its best? While the tests used in dynamic analysis are frequently standard- 
ized, the interpretation process is relatively uncontrolled. Different workers 
trained in the same theory do reach similar conclusions on most cases, but 
the process is definitely open to subjective^errors. 

The process of dynamic analysis begins with some type of -evidence... 
about-the-person’s behavior. The evidence may be from a clinical interview, 
from a self-report test (cf. pp. 323 ff,), from personal products and docu- 
ments, from observation in daily activities, or froin .specially planned test 
observations and projective tests. The data are related by the interpreter to 
everything else known about the subject, and an interpretation is made. 
The interpretative description is considered adequate when it is internally 
consistent and . not contradicted by any of the data available, and when it 
confonnsL-to ..the .analyst’s theoiy of personality. A description is some- 
tiines... limited to a particular segment of behavior, ” kich .as how the 
subject reasons or how he will function on a particular job. ^nie- 
times .it Jales to .dq.scribe him comjrletely, as he is now. The most ambitious 
an.d-WQst. .widely useful analysis not .only , considers. J:he.present but seeks to 
explahihow the personality developed, from infancy onward. Interpretations 
are of course treated as tentative, except where a definite decision must be 
made immediately. The accuracy of the analysis is reviewed where possible 
in the light of subsequent evidence. 

Tests usable in making a clinical evaluation of personality range from 
the commonest measures of intellect and psychomotor functions to highly 
complex observational tests of ability to work with others. Observation in 
connection with tests of ability has been discussed at length in earlier 
chapters. In the following sections, the reader will be introduced to tests 
of concept" fdffnafidnp reaction to fnistration, and social behavior. These 
are representative of the situational tests now in use. In Chapter 20, projec- 
tive tests will be discussed. 

Tests of Concept Formation 

Intellectual processes are commonly thought of as abilities, and in most 
mental tests the subject does try to do as well as he can. But mental per- 
formance has an important qualitative aspect, since people of similar ability 
attack problems in entirely different fashions. The way _ a. person us.es lus 
intelligence, given a problem that permits varied attack, may be more signif- 
iHnt-olinically than a precise IQ. A few tests have been designed to identify 
.Jha-lypiGal-wayain whioli-a person.ttses . .his mind. 'Iliese .tests. Jjave a close 
relation to tests , of ability, and might well have been considered in Part II 
oTflie book; hiit their use is almost exclusively in conjiectio_n.with thq. clinical 
study of personality 
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The most prominent tests of this type are the concept formation tests 
of Vigotsky, Hanfmann-Kasanin, and Goldstein-Scheerer-Weigl. ' The Hanf- 
mann-Kasanin' may be taken as an example. Twenty-two wooden blocks 
of several colors, shapes, heights, and sizes are placed on the table before 
the subject. On the hidden underside of each block is printed a nonsense 
syllable. Tlie syllable defines a type of block ( all mttr blocks are small and 
tall). The e}taminer tells the subject that the blocks are of four dififerent 
kinds, one of which is named mur, and that he is to find the four kinds and 
sort the blocks. The subject is then free to proceed as he likes to discover the 
classification, except that he may not invert the blocks to see the name. After 
each trial sorting, the ‘examiner points out one mistake and requires the 
task to be repeated. This goes on until the problem is solved and the prin- 
ciple of classification is stated. Obseiwation shows whether the subject uses 
a logical hypothesis (“perhaps all intir are triangles”), ^n arbitrary hypoth- 
esis, or pure guessing. The degree to which he can form an abstract concept 
from experience on successive trials is noted. One can observe ability to 
profit from a correction, ability to discai-d a false set and take on a new one, 
bizarre procedures and verbalizations, and so on. The test yields a numerical 
score, but more significant is the character of insight and conceptual think- 
ing displayed. The test is based on the tlieory that schizophrenics are unable 
to think abstractly but must respond to each object in the environmen t aslT 
si(^arate individual thing. Preliminary results show marked diflVrenccs be- 
tween schizophrenics, normals, and cases of organic damage (6). Tests of 
this nature have not be en ein ploygd as yet to as^s s dre think ing habits of 
persons of normal mentali^. 

10. Which te.st.s of the Stanford-Binet and Wechsler series are particularly suited 
to making observation of typical mental processes? 

11. Are any personality characteristics other than intellectual habits likely to be 
brought to light in the Hanfmaim-Kasanin test? 

• Reaction to Frustration; Stress Situations 

-T^SQ^techniques of studying reaction to frustration or emotional stress 
have been deseVibed earlieTln'tftFc^^ Observational Stress Test of 
the Air Force and the^ mca surement of change in test score before and after 
finjsfaadpm Bgdi^f these methods pernidt e ither quantitative or qualitati\^ 
scorhig^ Several other methods for studying stress have* been devised"T:his” 
problem having attracted widespread attention for a number of reasons. 
Most of the significant problems in the field of personality relate to emotion, 
and one of the surest ways of inducing emotion is to frustrate the subject. 
Differences in emotional behavior ai-e highly important in clinical diagnosis. 
And in practical selection, one of the important questions when filling 

“ Publisher: Psych. Coip. 

* Publisher: C. H. Stoelting Co. 
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responsible positions is: How can this candidate cope with pressme, criti- 
cism, and obstacles? 

.Frustration methods have produced considerable understanding of de- 
velopment of aggression and other emotional responses in young children 
(2). A common situational test is to permit the child to play with moderately 
attractive toys for a period of time. He is then given a brief opportunity to 
play with a set of “magnificent” toys, far superior to the first set. When these 
are removed and placed behind a barrier, behavior of different children 
varies widely. Some sulk, some claw at the barrier, some peacefully re- 
turn to the former toys. In playing with those toys, variation from the earlier 
play can be seen. Regression to a more immature form of play is especially 
common. Lerner has applied this method and games involving obstacles, to 
make clinical analysis of personalities of young children (14). 

Only three controlled attempts to use stress methods in selection have 
been reported to date: the Air Force tests, a test given to prospective police 
ofiicers (4), and the assessment program of the OSS (16). Presumably the 
method will be given further practical trial in civilian situations. 

12. What practical difficulties may be anticipated if one introduces frustration 
tests like that of the Air Force into an industrial selection program? 

13. Can one assume that the same frustrating situation is equally frustrating to all 
subjects? 

14. Justify Freeman’s statement that investigation of stress tests for possible indus- 
trial use must be done in real employment situations, since laboratory trials of 
the method are not comparable to the selection situation. 

Tests of Social Behavior 

If one had information about the intellectual processes, the emotional 
reactions, and the social behavior of a person, he would know about the 
most significant areas of personality. It is the third of these factors that is 
tapped by tests of leadership, cooperation, reaction to authority, and so on. 
Tests of interpersonal behaviors have become inci-easingly significant as 
psychology has extended its interest to the psychology of the group. In addi- 
tion to the research interests of social psychology, tests of social behavior 
have major importance for selecting leaders in all fields. Military psychology 
has pioneered in these techniques. 

j^jievice for. studying social behavior in perhaps its least complex form. 
is a^ German test for leader selection, developed before World War II. A 
special apparatus is used, consisting of two pairs of sliears, linked by ro(^sp. 
|;hat they must move, in unison.. While one shear is opening, the other is 
closing . Each subject, operates. pne_ pair of shears, cutting a series of in- 
creasingly. complex patterns from a sheet of paper, 'The shews afe lo ar^ 
ranged that if one man goes directly and forcefully at his task, the shears 
oij&e other man move in a rhyth m w hich make s ac curate cutting almost 
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By means of observation, automatic recording, and inspection of 
the product, the te ster l ooks for evidence of initiative, dQminance,. and co- 
op er atibh which is used with other data in assessing workers or soldiers, (13), 
coinplex are the OSS leadership tests, based on German and British 
methods. In one, a group of candidates for the intelligence service are di- 
rected to move a heavy eight-foot log, and themselves, over two walls ten 
®*SBt feet apart. and separated by an imaginary bottomless cha.sm. 
Observers., note which men take initiative and leadership, how they direct 
others, how they accept directions, and other characteristics (16, 1 ) . 

pi^ogram, of which this test is representative, is the most thor- 
°^gBgoing attempt yet made to assess personality for practical purposes (as 
distinguished from clinical or research work). The problem was important, 
since men selected would take responsibility for security of vital information, 
for the safety of units of the underground behind enemy lines, or similar 
duties. Candidates for assignment were drawn from the military services and 
from civilian positions. They were tested in small groups, each group under- 
going testing for three days. Testing included ability measures, a con- 
tinuous observation by the psychological staff, interviews, group discussions, 
and situational tests. At tlie end of the testing, each man was judged on the 
basis of his total personality, or as much of it as could be brought to the 
surface by highly skilled observers using test situations as varied as they 
could invent. 

One test likely to have widespread importance is the psychodranuL This 
method, which is also being applied to therapy, requires the subject to as- 
sume a role and play out a scene before an audience. In OSS testing, each 
man was assigned a role which might shed light on some facet of character 
about which the interviewers and observers felt they needed further data 
(18). A few of the themes used as a basis for psychodramas were: 

Personal criticism — 
failure to send liquor to cocktail party 
making a poorly protected bank loan 
Interpersonal conflict — 
competition in running dance halls 
differences of opinion on location of a hospital 
Authority-subordination — 
sergeant reporting bread riot to a major 

motion-picture producer securing permission from censor to show a 
picture 

Bie tyo or more men are , given their respective a.ss i gnTnftnt.«; . 
,^h.. .a JlketclTof the situation, but the dialog and outcome are left to develop 
i iL t h e . d iS£us^orrj\q^ on the technique are Iackiafe4t4a 

“flWthat ttiejnan.reyeaJ much STus^irTtomf^ of his insight 
into other.s. * 
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Perhaps the high point of ingenuity in situation testing to date has been 
the OSS construction test (16, pp. 102 ff.). This device is thought of as a 
leadership test. The subject is assigned to build a five-foot cube with a sort 
of super-Tinkertoy. Poles and spools must be fitted together, and since the 
test is too large to be managed by one man, two helpers are assigned. After 
givi ng directions, the tester ostentatiously clicks his stop watch and retreats. 
What the subject does not know is that his helpers are highly trained 
“slumblebmns.” Kippy is negative,, indolent, a drawback. Buster is an eager 
bea.ver, ready to do all manner of things,. mostly wrong, and also primed to 
needle the candidate with personal taiticism. Tliis is reported as a typical 
dialog (1, p, 95 ):” 

“Candidate: Well, let’s get going. 

“Buster: What is it you want done, exactly? What do I do first? 

“Candidate: Well, first put some comers together — ^let’s see, make eight of these 
comers and he sure you pin them like this one. 

“Buster: You mean we both make eight comers or just one of us? 

“Candidate: You each make four of these, and hurry. 

“Kippy: Whaeha in, the Navy? You look like one of them curly-headed Navy 
hoys all the girls are after. 

‘Candidate: Er, no. I’m not in anything, 

“Kippy: Just a draft dodger, eh? 

“Candidate: Let’s have less talk and more work. You build a square over here 
and you build one over there. 

"Kipptj: Who are you talking to — ^liim or me? Why don’t you give us a number 
or soractning — call one of us number one and the other number two? 

“Candidate: I’m sorry. What’s your name? 

“Buster: Mine’s Buster and his is Kippy. What’s yours? 

“Candidate: You can call me Slim. 

“Buster: Not with that shining head of yours. What do they call you, Baldy or 
Curly? Did you ever think of wearing a toupee? 

“Slim: Come on, get to work. 

“Kippy: He’s sensitive about being bald. 

“Slim: Just let’s get this thing finished. We haven’t much more time. Hey, there, 
you, be careful. You knocked that pole out deliberately. 

“Kippy: Who me? Now listen to me, you — , if tins - — thing had been built right 
from the beginning, the poles wouldn’t come out. For — , they send a boy out here 
to do a man’s job. . . 

Kippy and Buster are psychologists, and are in a position to make an 
excellent report on the man’s reaction. 

VALIDITY OF SITUATIONAL TESTS 

The validity of situational tests is in some respects not open to question. 
The n ^n is placed in a standard situation, and-bciw he reacts.i^ uoilBp€achr_ 
able evidence oLhonesty, irustiability. thin^ng methods^ and so on at that 

' Reprinted from the March, 1946, issue of Fortune Magazine by special permission of 
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time, in that situation. The difficulty lies in inferring how he will act on other 
occasions and in other situations. 

Systematic studies of the sampling reliability of situational tests are few. 
The evidence to date indicates that repeated measures on different occasioiis 
are highly correlated, provided that the tests on each occasion are fairly 
long. But there is little evidence on the reliability of frustration tests, which 
might be susceptible to the day-to-day variation in irritability of the subject. 

The first assumption needed to compare individuals is that the stimulus 
situation is constant. But failing on a puzzle means more to some men than 
others, and produces unequal fmstration. Some people are wounded by fail- 
ure, but others are unconcerned about puzzle solving and are unmoved by 
failure on such a task. In stress situations, threat of failure may well be great 
for a man to whom getting into the Air Force is very important; he may show 
more emotionality than another man who doesn’t care much. In combat, the 
qualities that made for a poor score in the Observational Stress Test may 
lead to superior devotion to duty, whereas the man who didn’t care may lack 
motivation in tlie field. If qualitative evaluation is made, it is unnecessary to 
assume that the situation was identical for all persons. Instead, one may note 
that the subject displayed great emotion, which means either low tolerance 
of frustration or strong motivation to do well in this test. The advantage of 
qualitative interpretation is that every behavior can be studiedlh its context,, 
which quantitative methods do not readily permit. 

It is then necessary to make inferences from the observed behavior, which 
involves many sources of error. If tlie test is objective and quantitative, one 
has the burden of demonstrating that the test behavior is related to some 
external situation. If the procedure is qualitative, one must find means of 
minimizing subjective errors. The OSS used numerous interpreters, who 
agreed on the final assessment in conference, after all had become acquainted 
with the candidate. This is a costly method, but it may be essential to reach 
sound judgments. 

The fact that assessors observing candidates in different situations were 
able to agree on the final assessment suggests that the OSS method must be 
getting at some truth about personality. Attempts to validate the method 
were not successful, on the whole. It was virtually impossible to obtain a 
dependable criterion for comparing men, since the assessees were assigned 
to different theaters of war, to perform different duties under different su- 
periors. Ratings made under those conditions could scarcely be predicted 
even with the most superb psychological insight into each man. The correla- 
tions of assessment ratings with criterion ratings mean very little. One repre- 
sentative figure is the correlation of .39 between assessment ratings from the 
three-day testing and ratings made in the field, for men whose job overseas 
was the same one for which they were assessed ( 16, p. 427) , A more f^UmVal 
validation was made by comparing sketches written during assessment with 
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descriptions of how the men performed in the field, for 36 cases. Staff mem- 
bers judged that in 24 of the cases, diagnosis and prognosis were essentially 
correct, and in only 5 cases was either essentially incorrect ( 16, p. 437 ) . This 
type of judgment is too much open to bias to be considered final evidence of 
validity. 

At this time, it is impossible to make a general evaluation of the validity of 
situational tests. So long as they are treated only as objective measures of 
limited traits, few questions arise. Evidence is quite inadequate to support 
any contention for or against their validity as measures of the total person- 
ality. As predictors, they seem to have promise, according to military experi- 
ence. Presumably, when analysts can study the personality with many tests 
and arrive at a coherent and consistent picture of the individual, the indi- 
vidual tests must be sound. But there is no criterion to show whether a test 
has correctly laid open the inner personality structure of the subject. And 
even witliin a sound and useful test, such as the Hanfmann-Kasanin, there is 
little evidence to show which of the many interpretations the test leads to are 
sound, or to show which types of evidence and inference from the test are 
untrustworthy. 

, We must necessarily await further research before generalizing about ob- 
servations in test situations. Situational and projective tests may be the only 
t ruly va lid testing approach to personality. But t hey are too. recent to be 
validated, since one must learn how to use a test before he can prove that 
his method of using it is valid. 

15. If Corey’s experimental test of cheating in a psychology class were repeated 
with the same students in a history class, what factors might cause different 
students to be identified as cheaters? 

16. A personnel manager wishes to judge six college seniors as prospective branch 
managers. He brings the group together and leads them to discuss problems 
related to the position, such as current labor relations. He lets the men do the 
talking, merely keeping the conversation going. After the conference, he ex- 
amines a recording made by a concealed microphone to determine how each 
man participated. What assumptions is he making if he hires the man who 
contributed mo.st to the solution of the problems discussed? 

17. If two psychologists, observing the situation testing of a subject, reach differ- 
ent interpretations regarding his personality, how could one determine which 
opinions are erroneous? 

18. If one desires to institute an OSS-type program for the selection of personnel, 
stressing situational tests and qualitative evaluation of the total personality, 
how should the psychologists to conduct the program be selected? 
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CHAPTER 20 


Projective Techniques 


The preceding chapter ended on a note of uncertainty; that sitaational tests 
leading to Cj[ualitative, dynamic analysis of personality^ are a promising tech- 
must uridergo extensive development and refinement,' Projective 
techniques, heing an extension and higher development of the same method, 
present similar problems. The projective test uses a stimulus even more un- 
structured than the situational test — ^if possible, one so novel that the subject 
can bring to it no specific knowledge of how to respond. White provides an 
apt summary of the stiength and the weakness of the method (21, p. 215): 

“Since little external aid is provided from conventional patterns he [the subject] 
is all but obliged to give expression to the most readily available forces within 
himself. It is further characteristic of projective methods that the subject does not 
know what inferences the experimenter intends to make; his attention is focussed 
on the play or task in hand, and it is well-nigh impossible for him to guess at its 
more remote psychological meaning. FayartiDle conditions are thus created for 
unsejfconsciQus revelation from the hidden rein's of personality. 

"From the nature of the material sought and from the unavoidable indirectness 
of the seeking it is obvious that the interpretation of imaginative productions will 
offer groat scientific difficulties. Even a short segment of imaginative behavior 
contains items, patterns, and ti'ends in bewildering variety; intuition can seize 
upon clues, but it is no easy matter to establish r^ability, observer agreement, 
and validity. This atmosphere of guesswork is highly uncongenial to the psycholo- 
gi.st who prefers his science in relatively finished form. To a certain extent, how- 
ever, it arises from the novelty of the problems and the recency of the attack 
upon them; pathways to a greater certainty are . . . already beginning to show 
themselves.”’ 

Projective tests com? neater to grasping “the, whole person” at once than 
any other. testing tecimique. Even the c omplex situational tests miist focus 
on s egmen ts ofjthe personalit y, suc h as leadership behavior or response to 
stress, and majiy s uch tests mu s t be combined to arrive a t a whole an aly sis. 
A single projective t est, however, wiU.often.yieliji a tentatiye„portrait of the, 
entire personality. Self-report tests were characterized earlier as a standard- 
ized clinical interview; in many ways, projective tests resemble a standardi- 
zation of t he dream interpretation that Freu3~mtrd d uceli . The two most 
widely used^ projective tests, the Rorschach and the 'Thematic Apperception 

’J. McV. Hunt (ed.), Personality and the behavior disorders. Copyright, 1944, The 
Ronald Pre,s.s Company. 
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Test, will be described here, together with a brief summaiy of other tech- 
niques. 


THE RORSCHACH TEST 
History 

The Rorschach inkblot test is the invention of Hermann Rorschach, a 
Swiss psychiatrist. As one of many studies of mental patients, he designed 
what he called “an experimental investigation in form-perception.” This led 
first to the finding that schizophrenics, manics, hysterics, and other clinical 
groups had different characteristic ways of perceiving. He also learned to 
analyze these perceptual characteristics as an indication of the personality 
make-up of subjects, normal and abnormal. He published his Psychodiag- 
nostics, giving the complet e outlin e of his procedui;e and theory, in 1921, 
shortly before his death at the age of 37. In this, together with a posthu- 
mously published extension of his hypotheses, he outlined the test and its 
interpretation substantially as it is used today. 

Marked interest in Rorschach work in the United States is usually dated, 
from the appearance in 1937 of a monograph by Beck, based on his trials of 
the technique in a Boston hospital. The other major leader in America has 
been.Kbpfer, who elaborated Rorschach’s original method. The Rorschach 
procedure was tested' with many clinical groups prior to the recent war. 
During the war it was used experimentally and practically by all armed 
services especially in connection with neuropsychiatric problems, Success in 
these trials has stimulated widespread attempts to apply the procedure to 
children and college students, industrial employees including executive per- 
sonnel, and all types of clinical groups. Recent methods of administering the 
Rorschach as a group test have increased the use of the test, though research 
on group methods is inadequate. 

Acceptance of the Rorscliach test has been retarded by the “cult” atmos- 
phere said to surround it. In part this arises from its origin in clinical prac- 
tice^ Clinicians who tried the test and were impressed by its frequent suc- 
cesses in individual cases rarely had systematic research to support their 
convictions. Traditional experimental psychologists and psychometric inves- 
tigators were not inclined to study the test, because one to two years of study 
were considered necessary to learn the test, and because it was. said to be 
based on “insightful inteij)retation” rather than upon the objective scoring 
conventional in American testing. Recent data (28) indicate that objective 
analysis can be made of test responses, although many test users are not 
willing to so limit themselves. Another stumbling block to use of the test was 
the development of two systems of interpretation. Minor differences were 
emphasized, and divisions among Rorschach users were established which 
are only now dying out. A JmthCT proWem in the acceptance of the Rors- 
chacli techniques is tliat a new vocabulary has been developed to discuss 



PROJECTIVE TECHNIQUES 


435 


the information it brings to light. The report of a Rorschach analysis is 
likely to contain words like “inner life,” “introversive,” “experience balance,” 
and “constriction” — ^terms which have little meaning to the person who 
has not made a detailed study of the test or psychiatric theory. Such terms 
are not conveniently definable, which makes the test harder to study than 
most, although it will be recalled that even so conventional a word as “in- 
telligence” can be interpreted only in the light of much study of the test 
used in its measurement. 


Test Procedure 

The Rorschach test is an experiment, an attempt to see how the subject 
refunds to a standardized stimulus. It differs from other experiments in the 
wide latitude of response open to the subject. The subject sits with his back 
turned to the experimenter. After establisWng rapport, the investigator gives 





Fig. 90. An inkblot of the Rorschach type. 

him a card, with approximately the following instructions: People see all 
sorts of things in these ink-blot pictiu-es; now tell me what you see, what it 
might be for you, what it makes you think of’ ( 25, p. 32 ) . 

The same procedure is followed for each card, the subject being permitted 
to respond as long as he desires. There are ten cards, which vary from V, a 
fairly good black-and-white representation of a bat, to IX, a vague mass of 
irregularly blended green, orange, and pink. The inkblots were selected by 
Rorschach from experimental trial of hundreds of blots. Parallel sets of 
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cards have been prepaied by Harrower, and Behn. Figure 90 shows an ink- 
blot of the Rorschach type. 

In addition to transcribing as completely as possible the remarks of the 
subject, the experimenter notes, without the subject’s awareness, how long 
it takes liim to respond to each card, whether his average time per response 
is high or low, the extent to wliich he turns the card, and other behaviors. 

1. An unstructured situation is one a person may interpret as he .sees fit. What 
questions about the task must the person an.swer for himself in taking the 
Rorschach test? 

2. Considering the Rorschach test solely as a sample of problem-solving behavior 
in a test situation, what hypotheses about the subject could be based on such 
data as the following. 

a. J. S. gives about twice as many responses as the average subject. 

b. H. G. takes an exceptionally long time per response. 

c. F. F. does not turn the black-and-white cards, but he repeatedly turns the 
colored cards, where he has difficulty finding possible answers. 

3. What information might be obtained from the way the subject expres.ses his 
ideas and the language he uses? 

The Integrative Analysis 

Before considering how the responses are treated, it will be helpful to 
examine a .specimen Rorschach report to see what is meant by the test’s claim 
to describe the total personality. Although the test yields quantitative scores, 
the fullest outcome of the analysis is a description of the person, embracing 
his abilities, interests, motivation, and social attributes. Such a description, 
if complete and valid, would necessarily give a unique report for every 
person, corresponding to our knowledge that all people are really diflferent. 

Descriptions of cases may be brief or extensive, depending on their in- 
tended use. Each of the following paragraphs is a description of a college 
student, where the aim was a brief characterization (28, pp. 22-23) : 

“Immature, expansive girl. Unsystematic, rather careless, probably socially ori- 
ented and not much interested in studies. Moderate intelligence, warm, lively, but 
rather superficial.” 

“Constricted personality, emotional responsiveness deeply repressed. Very eager 
for approval and prestige, but fundamentally detached in human relationships 
and pedantic in approach to ideas. Conscientious, over-systematic, with rigid 
ideals and undue concentration on learned detail at the expense of independent 
or imaginative thought. Need for routine and exactness, with fear of freedom. 
Difficulty in meeting new .situations or protracted situations of uncongenial na- 
ture. External adjustment as student likely to be satisfactory, but with very stub- 
born limitations.” 

In contrast to tlie above, which is nearly the minimum that the integrative 
approach to the Rorschach can provide, thorough analyses are often several 
pages long. Sample interpretations appear in the books by Beck and Rors- 
chach, who include records of scholars, psychotics, feeble-minded persons, 
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and children (5, p. 37). Among the questions the test can answer, at least in 
some records, are the following: What is this person’s characteristic attach on 
problems? How gi'eat ai-e his intellectual resources, including precision of 
observation, ability to organi?e, and originality? How hard does he drive 
himself; does he make the most of his ability? What emotional factors ac- 
count for any gap between ability and quality of production? Has he any 
special interests? Does he enjoy social intercourse and can he adapt to other 
people? Is he at ease internally? Does he habitually express his emotions 
impulsively, repress them rigidly, or use them to enrich social relations? 

4. For each of the above questions, what test or tests discussed in earlier chapters 
(if any) would give the same information? 

5. Which of the above questions would it ordintuily be important to answer in 
each of the following situations? 

a. In guiding a college student having scholastic difficulty. 

b. In determining whether a distiubed person is under strain severe enough 
to cause a breakdown, 

c. In selecting a personnel mairager. 

Scoring Procedure 

The first step in summarizing a Rorschach performance is to “score” each 
response. The scorer classifies each answer in three ways: according to Ipca-, 
ti on, dete rminant, and content, This scoring is guided by directions and 
Icoring samples such as those for the Stanford-Binet test. Hertz has found 
that trained scorers, using the same system, agree on 93 percent of scorings 
on principal factors (17). Before a tester can score Rorschach responses 
dependably he must have studied the test and practiced analysis of records 
under supervision. For this reason, no attempt is made to teach the scoring 
procedure here. Just enough information is given to show how the test 
operates. 

The scoring of four responses will illustrate the procedure. Suppose that 
on Card X, which is a mixture of bright-colored forms, we have these re- 
sponses: 

1. A big splashy print design lor a summer dress. 

2. Enlarged photograph of a snowflake (refers to a large crab-shaped area) . 

3. Two uttle boys blowing bubbles. You just see them from the waist up. 

4. Hoad of a rabbit. 


The scoring'-* of these responses (based in part on supplementary information 
obtained by inquiry) is: 


Response 

1 , 

2 . 

3 . 

4 . 


Location 

Determinant 

Content 

W 

CF-I- 

Art, Cg 

0 

F- 

N 

D 

M+ 

Hd 

D 

I'd- 

Ad 


Tills discussion is based on the Beck system because of its greater simplicity and 
objectivity. 
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Locations are scoi’ed according to the area of the blot used in the response. 
Categories are W (whole), D (commonly perceh'cd subdivisions), and Del 
(unusual details). “Dress design” referred to all of the blot, and is scored W. 
The others are detail responses, scored D. 

In scoring determinants, the scorer identifies the elements in the blot 
which led the subject to see what he did. Careful supplementary questioning 
is needed to make certain of this scoring. The principal determinants arc 
form, color, movement, and shading. Most responses are dependent upon the 
.shape of tlie blot (F). Responses which fit the blot are scored FH-; those 


Table 68. Determinants Scored in Rorschach Systems 


Determinant 

Typical Response 

Beck 

Symbol" 

Klopfer 

Symbol" 

Related to 

Form 

Shape of blot 

Cot face; totem pole 

F 

F 

Awareness of reality 

Movemenf 

Human seen in action 

Men fighting 

M 

M 

Poise, fantasy 

Animal in action 

Bird flying 


FM 

Less mature use of 
imagination 

Inanimate force 

Explosion 

— 

m 

Inner conflict 

Color 

Color of blot 

Shading 

Red butterfly, fire 

C 

c 

Socia hemotiona 1 re- 

sponse, impulsiveness 

Perspective 

Landscape 

V 

FK 

Feeling of inferiority 

Texture 

Fur rug 

y 

c 

Tact, shyness, caution 

Hozy forms 

Cloudr smoke 

y 

K 

Anxiety 

Hazy quaJIty differentiated 

X-roy; relief mop 

y 

k 

Anxiety 

Achromatic color 

Black cal; white sail 

y 

C' 

Depression or elation 

FosUion 

Response to place on card 

"Heart, because in center" 

Po 

Po 

Irrationality 


^ Where two or more detcrminnnts ufFect the rCHponse* n combined symbol is used, e.g.* CF, Fc, YF. 


which are arbitrary in form are scored F— . Responses involving color also 
involve form in most cases. Where two or more determinants affect the re- 
sponse, they are combined. FC is a response, such as “yellow lion,” in which 
the form is crucial. In CF responses such as “dress design,” almost any shape 
of blot would be equally appropriate to the answer. Movement is of course 
not actually in the blot. A response seeing a human in movement indicates 
that the subject has imaginatively added something to the static form of the 
blot. Human or human-like action is scored M. A summary of all determi- 
nants is given in Table 68. 

Content scoring uses such categories as humans (H), parts of humans 
(Hcl), airimals (A), parts of animals (Ad), objects (Ob), etc. “Snowflake” 
was scored nature (N). “Dress design” was scored Art and Clothing (Cg). 
An additional score, P, applies to. certain popular responses wliich are often 
given. Original responses are scored O. 
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Content should be analyzed for “thematic material,” in which the subject 
is using an inkblot to tell a story drawn from his own suhconscious. One can 
note tentatively the possibility that a remark like the following reveals the 
subject’s attitude toward his life: 

(To Card VI.) "This is a cutworm. This is a section of a field and the cutworm is 
going up the row charging out leaving nothing behind but ground and desolation. 
Ih’obably tliis would represent feelers, groping as it goes along. He’s finding it 
pretty fruitless up here — this (behind him) has been better.” 

The scoring of responses is summarized by counting the number of times 
each symbol was used. 

6. Score the following responses to Card X as well as you can without benefit of 

inquiry or access to the inkblot Itself. 

a. A collection of brightly colored insects, mounted. 

b. A cow jumping through the air. 

c. A rough piece of pink coral. 

d. A tiny horse. 

Meaning of Scores 

Jnlerpretatiou of Rorschach scores is difficult to learn. While trait names 
are associated with the various symbols, the full meaning of the test be- 
haviors is oversimplified by the trait names. Caution must also be exercised 
because the same score has different meaning in different records. Any one 
score must be related to the total pattern. For example, a person who is 
forcing to obtain a large number of responses is more likely to have F— 
responses than one who is less productive but has equal ability. 

As a problem-solving situation, the t est reveals much about the int e llect 
and drive of the person. Form quality is dependent on self-criticism._ F— r e- 
■spon ses" suggest lack of accuracy and are common in feeble-minded and 
psychotic performances. An extremely high percentage of F+, on the otlier 
'hand, indicates very cautious self-criticism. The number and quality of re- 
sponses is a measure of intellectual output. Quality of output is shown in 
precision of ideas, complexity of organization, and originality. 

Location reveals the subject’s problem-solving attack. Emphasis on W 
indicates an, effort to solve the problem in a hard but complete' wayV and" 
ig associated with intellectual ambition, interest in systematically organized 
ideas and theoretical matters. The D approach, breaking a total problem into 
more meaningful sections, is thought of as a “common-sense” attack, one 
which must be used in pract ical affairs. Breaking the blot into tiny or unusual 
areas must represent some special drive. Sometimes it is a demand for perfect 
con-espondence of blot with idea which can only be attained by matching 
some tiny form with an idea. Persons using many ti ny d etails are often, hi 
their daily l ives, preoccupied by Privial and practically insignificant matters. 

Hunter finds a correlation between Rorschach estimate and measured IQ 
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of .T Sf (22). In a gr oup of young children varying widely in age, several 
Rorschach scores correlate .60-.70 with Binet MA (11). But the Rorschach 
is less a' measure of “how much intelligence?” than of how it functio ns. It 
may indicate that a person is not using his potential because of lack of drive 
Or becau.se of neurotic clinging to “certainties” which inhibits creativeness. 
It may indicate a good ability which breaks down under emotional stress. It 
may reveal a need for achievement that produces impres.sivc results by sheer 
effort toward quantity rather than superior insight or ability to organize. 

Content is a sign, as Beck puts it, of the person’s “mental furniture.” Spe- 
cial interests or preoccupations are .significant. 

Study of emotional make-up is based chiefly on determinants. A logical 
answer to “What do you see in the blot?” is one determined by form. One 
who uses form as a determinant similarly tends to base actions on reasoning 
or knowledge of facts. A person who acts on logic and logic alone is quite 
rigid. Modifications of a purely rational approach arc seen in responses using 
color, shading, or movement. 

In one of the most astonishing features of the inkblot test, response to 
color is found to be an index of affective response. The colored cards seem 
to be emotionally stimulating. The effect is that of introducing an irrelevant, 
irrational, but exciting factor into an already difficult intellectual assignment. 
This is not unlike having to make a life adjustment in which people are in- 
volved. Sonj.e_s^j>ct.s ignore the colors— they would also be expepted to dis- 
sjregard the emotions of others, to.be formal and co ntrol l ed. " Color shock” is 
frequent, in which the force of the color upsets the person’s ability to pro- 
duce good responses. Persons who are friendly and socially adaptive produce 
ma ny Ff ijgsponses. CF respomses ai-e associated with emotional instability, 
egocentricity, ancT excitability; pure C re.sponses, with uncontrolled emotion- 
ality. 

The shading responses are not fully understood. They .seem to represent a 
hesitant taking-everything-into-account which may be related to tact and 
caution in dealing with people. In other responses, shading seems to be 
associated with anxiety and depressive mood. 

Movement responses are considered an expression of the subject’s inner 
impulses. A high M score is associated witli rich imaginativeness and fantasy. 
Absence of M suggests inhibition of spontaneous ideas. The M score may 
therefore be considered as representing self -acceptance and internal security. 

Rorschach characterized people by a ratio usually called the experience 
type. This is the balance between use of movement and use of color. Tliose 
wth large amou nts of m ovem ent are called int rove rsive. This is not the same 
as the neurotic introversion of self-report tests. The introversive perso n find.s.. 
jrui o ih,.s ati sfa£HfliL m his "inner life”.-and is re- 

sponsive to others. The extratensive, or color -emphas izing, person is more 

“ This is a “contingency coefBcient”; in this problem, perfect correspondence would 
yield a coefficient of .89 instead of 1.00. 
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respoiisive and more oriented to stimulation from without. People who avoid 
both M and C are referred to as constricted. These are not rigid types, since 
the factors may be present in any proportion. 

7. Why might a “perfect” accuracy score, 100 percent F +, indicate an undesir- 
able personality pattern for many situations? 

8. Do the Rorschach scores related to quality of output reveal maximum ability 
or typical Irehavior? How does the Rorschach compare with the Binet test in 
that respect? 

Scoring is gbjective^ In addition to the determination of scores, however, 
the trained scorer looks for many cues in the interrelation of responses. The 
analyst tries to observe how well the subject recovers from emotional shock, 
how he copes with blocking or failure, whether he is in conflict about sex. 
Following scoring, the tester makes an interpretation of the total pattern of 
the subject. Hertz makes this statement regarding the objectivity of the 
process ( 18, p. 538) : 

“It should be noted that the final analysis in the procedure of the interpretation 
in terms of other clinical and test data defies standardization, as Rorschach origi- 
nally contended. The information gleaned from the Rorschach material is pro- 
jected against family background, education, training, health history, past life, 
qualitative judgments of the examiner and of other people, and other clinical and 
test data. This is then interpreted in terms of the examiner’s experimental knowl- 
edge of the dynamics of human behavior. Final conclusions are made by inference 
and analogy depending upon the experience, ingenuity, the fertility of insight, 
.and, not to be forgotten, the common sense of the examiner. Prolonged and ex- 
tensive experience is necess.ary, not only with human personality but with all kinds 
of clinical problems. This last step by definition, tlierefore, is personal to the ex- 
aminer and subjective in view. It permits no norms, and it el udes all standardiza- 
tion.” 

Despite the importance of the personal skill of the analyst, and despite 
differences in scoring systems, d ifferent users o f .the_.tBst_ cap reach _pom;;:. 
parnble mncliisi nns. This is best demonstrated in a study where one record 
"was^iterpreted by Hertz, Beck, and Klopfer. The three summaries agreed in 
all significant respects (20). 

Group Procedures 

The a pplicatio n of the Rorschach meth od to groups takes .^o forms, The 
Group Rorsdiach is administered by means of slides. It permits less adequate 
inquiry than the individual test bufotherwise is satisfactory for intelligent 
'i?fe^crats~and'adults. Munroe has developed a short scoring procedure for 
rise in screening* '('28). This procedure yields a single numerical index of 
adjustment. 

A multiple-choice Rorschach has also been tried. Validation studies indi- 
cate that the procedure is valueless in its present form. 
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Validity 

Description 

The Rorschach, in attempting to describe the total individual, presents 
again the problem of what to use as a criterion. One satisfactory method is 
the comparison of a blind diagnosis, made from the test record alone, with a 
report from one who knows the patient well. These blind analyses are often 
striking in their close correspondence to facts about the person. Table 69 

Table 69. Partial Comparison of Blind Rorschach Analysis and Case History of a 

Negro Girl (8) 


Rorschach Interpretation Case History 


Considerably above average In intelligence 
Cultural opportunity meager 
Into things which ore viewed in a practical way by 
others she Introduces new elements out of her 
fantasy. The danger here is that the element she 
introduces may be not only original but, to 
others at least, extraneous. 

She is very critical, and her stubborn Insistence on 
seeing things her own Way very likely brings her 
into conflict with persons In positions of greater 
authority. 

Insecurity and fantasy going in the direction of 
phobia 


IQ 109 

Restricted environment 

A great capacity for confusing fantasy with 
reality. She did not think it Was peculiar that 
the dead should visit the living. 


Maintained a superior attitude on the ward. 
Usually sarcastic and snappy. Always very 
definite and positive In her attitudes. Did not 
hesitate to voice her opinions. 

Screaming at night, continually afraid of men, 
dogs, automobiles. 


compares parts of a blind analysis and a case history, written independently, 
describing a 16-year-old Negro girl (8). In another study, analyses were 
made of 20 records of problem children. Judges were a.sked to match the 
Rorschach reports with case abstracts, five pairs at a time. There were 84 
percent correct matchings (26). Hertz lists 20 studies reporting generally 
successful blind analyses of Rorschach records for various pathological cases 
(18, p. 546). 

One of the few systematic comparisons of test results with behavior de- 
scriptions for many cases deals wiA adolescent problem children at a camp. 
The writers conclude (49, p. 93): “Although inconsistent in many details, 
the Rorschach test in most cases gave a total pensonality picture not incon- 
sistent with . . . observations of behavior. . . . Certainly in this study, suc- 
cessful evaluation of behavior tendencies could only be arrived at through 
careful analysis of the total configuration. Single determinants and simple 
relations in the Rorschach psychogram were of little prognostic value.” 

There has been no systematic validation of the test, score by score or trait 
hij trait. Accumulated blind descriptions show that on the whole the test does 
much better than chance; there is less evidence that each specific determi- 
nant, such as texture, is correctly interpreted. Until this evidence is provided, 
probably the test will lead to some invalid inferences from scores. 
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Clinical Diagnosis 

The test has been widely used to classify abnormal subjects. It is most 
useful as part of a systematic diagnostic procedure, the test indications being 
confirmed by other methods. Evidence of abnormality is to be found in many 
.specific signs in the test, as well as in the general quality and character of 
thinking. Beck summarized studies of hundreds of schizophrenics in part as 
follows — first in terms of scores, then in terms of the parallel traits (4, 
pp. 106-107): 

“F + is low; color nuance tends in the direction of pure color, or C; Dd is high; 
sequence, or order, tends toward the confused; DW (wholes suggested by a 
minor part) is the most characteristic method of approach by a patient to liis 
problems. . . . 

“They misinterpret their world. . . . His willingness to be inaccurate is in- 
difference to the world’s opinion of his percepts. It is a lack of self-regard. Primi- 
tive, instinctual forces dominate schizophrenic personalities; when they are sensi- 
tive to the world’s opinions, passions still rock them. They pay attention to and 
see meaning in insignificant, irrelevant material within their field of perception, 
material their healthier fellows neglect. . . . They are not orderly or logical, by 
the world’s standards.’’ 


For detecting neurosis, the “signs” of Miale and Harrower-Erickson have 
been widely used. They found certain score patterns especially frequent in 
neurotics, and infrequent among comparable normals. The signs are listed 
in Table 70. Any subject .showing five or more signs is likely to be neurotic 


Table 70. Rorschach Signs of Neurosis (13) 


Percentase of Subjects Showing 
Each Sign 

Sign — 

Neurotics Normals 

hi = 74 N = 385 




25 or fewer responses 
Not over 1 M 
More FM thon M 

Color shock — indication of emotion or blocking on colored 
cards 

Shading shock — emotion or blocking on shaded cords 

Refusal to respond, on any card 

Over 50% F responses 

Over 50% A and Ad responses 

Not over 1 FC 


97 

62 

72 

32 

86 

S7 

46 

6 

59 

5 

65 

12 

57 

19 

57 

22 

79 

29 


(13). Kelley has written at length of the characteristics associated with 
organic defects, feeble-mindedness, convulsive states, depression, alcoholism, 
and so on (25). The studies on which clinical classification is based are rarely 
systematic investigations. Cumulative experience with cases of a given type 
has led to generalizations, but it is rarely possible to find validity coefficients 
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or other evidence of the degree of confidence to be placed in a Rorschach 
interpretation, 

Benjamin and Ebaugh found agreement between Rorschach diagnosis and 
clinical diagnosis to be between 85 and 98 percent, for 50 varied cases (6). 
Rapaport made an unusual statistical study, in which small, carefully diag- 
nosed groups were compared on each Rorschach score. He found statistically 
significant dilferences, generally in the direction predicted by Rorschach 
theory. Although his study makes it undeniable that there are correspond- 
ences between Rorschach performance and abnonnalities, he does not vali- 
date any mechanical system of signs for classification. He emphasizes that 
clinical diagnosis by Rorscliach must be based on a study of the dynamics of 
test performance rather than on signs alone ( 33, p. 367 ) . 

Screening 

Like pencil-and-paper tests, the Rorscliach has been used to detect ab- 
normal subjects in general populations. The test is fairly eflr’ective in screen- 
ing neurotics. A cutting score of five neurotic signs identified 80 percent of 
Harrower’s neurotics, at a cost of 15 percent false positives (13). Ross, how- 
ever, found that the same signs segregated people of low socio-economic 
status who were not neurotic (39). He considers the signs to bo measures of 
insecurity, not neurosis. Much remains to be done to cleiar away similar 
fallacies in other Rorschach assumptions. 

Munroe applied her inspection technique to freshmen at Sarah Lawrence 
College to identify possible problem cases. She rated 348 girls on a four- 
point scale as to adjustment, after a quick study of their Group Rorschach 
protocols. She placed 185 ghls in the two "best” categories, and 163 she rated 
as having moderate or severe problems. Ratings of emotional maladjustment 
based on teachers’ observations and need for psychiatric services during two 
years showed 147 of the "good” group and only 52 of the “poor” group to be 
adequately adjusted (28). 

Prediction 

Use of the Rorschach for predicting academic and vocational success is 
just beginning. One line of evidence on its potentialities for vocational guid- 
ance comes from records of those who are successful in a line of work. There 
is no one personality most common in any one vocational group, according to 
studies of artists, pharmacists, and accountants (23, 35). In one study Roe 
did find a rather homogeneous “occupational personality” among paleontolo- 
gists (34). 

Evidently in most vocations people of very different natures can achieve 
success. There are different ways of succeeding in a profession — a doctor 
may be adept at building patients’ confidence, at making accurate diagnoses, 
or at dexterously manipulating instruments. Such a point of view would 
stress use of the Rorschacli in guidance for description of tire subject, rather 
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than in any quantitative application. There is no justification for use of the 
method as a mechanical selection test, unless each record is clinically evalu- 
ated. Even the clinical method is inadequately validated. 

Munroe’s adjustment rating was successful in predicting college freshman 
grades, a corrected coefficient of .49 being obtained. A similar coefficient for 
the American Council test of general ability was only .39. Rorschach was far 
more successful in identifying probable failure than ACE. There is need for 
evidence of the predictive value of the Rorschach in more typical colleges 
than Sarah Lawrence, which has a select student body and uses “progressive” 
educational methods. 

Success of therapy can be predicted, according to several studies. For ex- 
ample, the success of insulin treatment was predicted on the basis of Rors- 
chach records of schizophrenics before treatment. Of 47 cases where im- 
provement was predicted, 44 did improve. Of 13 where failure was predicted, 
9 did not improve ( 32 ) . 

9. How might people with different personalities achieve success in pharmacy 
in different ways? 

10. How might the Rorschach test be used in a college program, so long as its 
predictive validity is no higher than .50? 

11. Munroe found that some people with poor adjustment did superior college 
work. How can this be explained? 

Stability of Rorschach Performance 
Adequate reliability coefficients on tire Rorschach test are hard to obtain. 
Split-half coefficients are technically unjustified. Retesting is usually of un- 
certain meaniirg because of memory. Kelley and others overcame this prob- 
lem by testing 12 cases before and after mild electroshock, which wipes out 
memory for the first test but does not change personality. They report that 
“in general, in every test, the psychogram produced was essentially un- 
changed . . . and the diagnosis was in exact accord with the clinical im- 
pression” (24, p. 36). Some single scores, however, sliifted considerably, 
which emphasizes the importance of studying patterns of scores. Percentage 
scores based on .short records were notably undependable. 

Troup obtained retest data for identical twins, with a six-month interval 
between tests. Correlations of main scores were fairly high ( W%, .82; M%, 
.79 ) . Judges were able to match the two tests for each person in 92 percent 
of the cases. Tliat Rorschach patterns are highly individual is shown by the 
difficulty die judges had in trying to match the test of each subject with tliat 
of his identical twin. Success was little better than chance ( 46 ) . 

Application of the Rorschach Method 
The Rorschach is widely used clinically to understand the characteristics 
of patients and problem cases. It yields an unusual amount of information in 
a short time. It indicates quickly whether a criminal or a delinquent, for 
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example, should probably be examined further for psychotic trends or is 
functioning normally. It supplements Binet results for poor learners by .sug- 
gesting whether low scores represent lack of mental power or inhibitions 
wliich prevent use of that power. Physicians, social workers, teachers, and 
counselors now frequently refer cases for Rorschach diagnosis. 

As a prediction or screening device applied to normals, the test has excited 
much interest; but research on normals is scanty. Its peculiar claims should 
make it useful for identifying problem students and employees before overt 
trouble develops. It may lead to prevention of psychosis by identifying cases 
for treatment in a pre-psycholic stage. It may help in identifying the peculiar 
temperamental assets which make a productive genius, an inspiring leader, 
or an insightful counselor. If any of these prospects is borne out by adequate 
research, use of the test with normals will be highly significant. 

The test will never be a tool for use by every tester. Because the flexibility 
of the test eludes those witli superficial ti'aining, the test must be given by 
one who has specialized in it. Business firms and schools have found it de- 
sirable to turn over Rorschach testing to consulting psychologists with ade- 
quate experience. 

Among the sorts of research to which the te.st is adapted, .special mention 
should be made of developmental and intercultural comparisons. The emer- 
gence of personality can be studied by administering the test to children of 
different ages. The test has been used with some success on preschool chil- 
dren (11 ). Plertz has reported on differences between early and late adoles- 
cents (19). Causal factors may be studied by comparing records before and 
after therapy or other events ( 5, pp. 336 ff. ). The fact that the test is thought 
to be equally valid in any culture has made it useful to cultural anthropolo- 
gists. Differences between societies in temperament, reflecting different pat- 
terns of upbringing of children, are revealed objectively by the test. DuBois 
took records of the Alorese, on a Pacific island, and had blind interpretations 
made by Oberholzer in Switzerland (10). The results were in close corre- 
spondence to die analysis of personality trends based on observation of the 
islanders. 


THE THEMATIC APPERCEPTION TEST 
Description 

In 1938, Murray and co-workers published probably the most elaborate 
systematic study of personality to date, after studying a gi'oup of subjects by 
many methods (30). Among the major tools developed was the Thematic 
Apperception Test. The test requires the subject to inteiqjret a picture by 
telling a story about it— what is liappening, what led up to the scene, and 
what will be the outcome. 

Like the Rorschach, the TAT is an unstructured experiment. The pictures 
are chosen to allow varied interpretation. Apparently the responses are die- 
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tated by the past experiences, conflicts, and wishes of the subject. Essentially 
the person projects himself into the scene, identifying with a character just 
as he vicariously takes the place of a movie star when he sees a photoplay. 
Unlike the Rorschach, the major analysis is based on the content of re- 
.sponses rather than formal aspects. The TAT, more than the Rorschach, 
identifies the subject’s drives, conflicts, and attitudes toward self. 

The TAT consists of 20 pictures, somewhat different pictures being used 
for men and women. The subject is led to believe that his imagination is 
being tested. Two one-hour sessions are used for obtaining stories. In addi- 
tion to the stories themselves, the observer notes behavior during the test: 
evidence of emotion, bizarre language, and habits of attack on problems. 

Analysis of Responses 

The performance of the subject is evaluated in both objective and sub- 
jective ways. The general quality of think ing — originality, logic, and so on — 
may be judged. The ability to maintain control under emotional stress is 
revealed, since usually some cards shock the individual by picturing prob- 
lems similar to his own. Language and expression may show attempts to 
appear intellectual, schizophrenic lack of self-criticism, or other significant 
trends. 

Murray scored responses in terms of “press” and “need.” A press is a force 
acting on the individual; a need is something he seeks. Every act reported in 
the story may inferentially be classified as resulting from a press or a need. 
The following story analyzed by Sanford illustrates the procedure and some 
of the categories used in scoring need (n) and press (p). The picture used 
shows a man clinging to a dangling rope. A 13-year-old girl gave the follow- 
ing story (3, pp. 567-590); 

“Tliis bandit (n Acquisition and n Aggression) got put in jail (p Dominance, 
p Aggression, and p Blame), and he kept trying to get away (n Autonomy, and n 
Blamescape). They made harder rules and gave him less freedom and less com- 
fort (p Dominance, p Aggression, and p Retention). The monotony was making 
him mad (p Lack, p AfBiction). He had rather be shot than live in this awful 
dungeon (p Noxiance, p Disorderly Surroundings, and n Autonomy, n Abase- 
ment). He was going to try to anyway (n Autonomy and n Inviolacy). He got a 
rope (n Acquisition) and started to climb up the wall of the prison (n Auton- 
omy) . There was a big searchlight and when it was turned off him he started up 
(n Seclusion and n Autonomy). Then it turned and they saw him — all the jailers 
were after him (p Knowledge, p Dominance, and p Aggression). He was crafty. 
He kept dodging and they couldn’t shoot him (n Seclusion and n Harmavoidance) . 
Finally they shot the rope (p Aggression) and he fell and was killed on the 
cement floor. (Bad outcome: Death).”* 

The final interpretation is no set of scores on traits, but a consideration of 
the interaction of needs and of themes of individual stories. In the stories, 

* By permission from Child behavior and development, by Roger G. Barker, Jacob S. 
Kounin, and Herbert F. Wright. Copyrighted, 1943, by McGraw-Hill Book Co., Inc. 
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one looks for the hero with whom the subject identifies and tries to discover 
the real-life counterparts of the other characters of the story. No interpreta- 
tion of a single story is dependable, but the story given above might well be 
that of a gii'l trying to establish independence of a family she considers auto- 
cratic. It is significant that she feels that if the hero (herself) tries to escape, 
he can’t win. 

Although the Murray method has been helpful in many studies, it is com- 
plex and can be used precisely only after training and experience. Most 
clinical workers now place emphasis on the study of content rather than 
on scores. It is possible to shorten the test by selecting just those pictures 
presenting conflicts suspected of having significance for a particular case. 

Other Thematic Picture Tests 

Standardization is less important in the TAT than in the Rorschach be- 
cause treatment of results is less objective. Many workers have completely 
abandoned Murray’s pictures, using other sets more suitable to their subjects, 

Henry tested Indians witli pictures of Indian life. He based his analysis on 
the following aspects of performance: mental approach, creativity, behav- 
ioral approach (ideas about peer, adult, school, and other 'daily relations), 
family dynamics, inner adjustment and defense mechani.sms, emotional re- 
activity, and sexual adjustment (14). 

It is possible to adapt the pictiue technique to testing of a specific attitude 
or problem — anti-Semitism, family relations, etc. The selection of pictures to 
probe a particular area is a powerful addition to the method which is only 
beginning to be explored, 

12. Of the seven factors studied by Henry, which would also bo described by the 
Rorschach test? 

13. Describe two possible stimulus pictures for studying each of the following 
problems. 

a. Attitudes of workers toward management and its practices. 

b. Feelings of teachers about problem children. 

0 . Attitudes of college students toward their studies. 

Validity and Reliability 

Objective treatments of TAT perfonnance are disappointingly unreliable. 
Sanford and associates made a variety of such checks. Need scores assigned 
by four judges for the same test records had an average correlation of only 
.57. Correlations between scores obtained at separate times were low. Even 
split-half coefficients were, at best, around .48 for need scores (41, pp. 262- 
263). Scores on needs expressed in fantasy were correlated with ratings of 
needs indicated in overt behavior, carefully observed. But these correlations 
were so small (average, .11) tliat need scores cannot be used to predict be- 
havior. If tire test users are correct in saying that people often have needs 
which they cannot express in their overt behavior, this evidence does not 
prove the test to be worthless. But a test which purports to reveal “sub- 
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conscious” factors can be validated only by an equally “subconscious” cri- 
terion. 

Such a validation is often sought by comparative projective interpretations 
with each other. Workers have repeatedly commented on the consistency of 
the pictures of personality derived from diEFerent sources, but there have 
been few attempts sufficiently well controlled to evaluate the degree of 
dependability of any one test. Hemy compared his picture-test records with 
Rorschach, life history, and other data, concluding that in 451 comparisons 
on specific descriptions there was essential agreement between TAT and at 
least one other source in 83 percent. There was definite disagreement in only 
2 percent ( 14, p. 71 ) . Separate case analyses for eight girls were made on 
TAT, Rorschach, life history, and a battery of pencil-and-paper emotional 
and character tests. One judge was able to match all 32 descriptions in sets, 
while other judges matched 18 and 15 correctly ( 14, p. 56). Some aspects of 
the interpretation were more thoroughly confirmed than others. Harrison 
obtained similar success for a point-by-point comparison of TAT with case 
histories of patients. He claimed 82.5 percent confirmations. IQ guessed 
from the test correlated .78 with Binet IQ. Attempts to classify patients by 
psychiatric categories were correct in 77 percent of the cases ( 12 ) . Neither 
Henry nor Harrison used the Murray method of scoring. 

A study by Murray and Stein attempted to predict military leadership 
from responses to selected pictures in a group test. The rank-correlation of 
test results with ratings of ROTC men by their superiors was .65 (31). Al- 
though no validated results are yet available, work by Henry on the person- 
ality of business executives (15) promises that the test will be useful on 
many practical prediction problems that have been impervious to other tests. 

OTHER PROJECTIVE TESTS 

While the Rorschach and TAT have the prestige associated with wide 
application, projective testing is a general technique, rather than a name 
given to a few tests. There are a large number of tests classified as pro- 
jective, a few of which are listed in Table 71. Few of these tests, however, 
appear likely to be standardized and used as routine testing procedures. The 
reason is that the projective technique can be applied to any subject matter, 
so that each investigator and clinician uses what fits most easily into his 
immediate needs. 

Nearly any situation in the person’s life permits sufficient variety of in- 
terpretation to be used as a projective test. Tlie analysis of personality trends 
in the Binet and Wech.sler tests is essentially a projective method. Systematic 
observation of playground activities permits similar inferences about drives, 
habits, and tensions. The only unique characteristic of projective tests, so- 
called, is that the stimulus materials are designed to evoke revealing re- 
spoirses. There is no limit to suitable material. Experiments with preschool 
children have .shown all these methods to have value (27) : 
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Table 71. Represenfafive Projective Techniques 


Test and Publisher 

Stimuius Material 

Age Range 

Special Features 

References 

Mosaic test 

Colored geometric 

Adults 

Studies quality and 

9,48 


shapes to make a 
pattern 


process In thinking 


Picture- Frustration 

Cartoons with un- 

Children, 

Sludies reaction to 

38 

Test 

finished dialog 

adults 

frustration 


Rorschach (Grune 

Inkblots 

Age 2 to adult 

Relatively objective. 

5, 25, 37 

and Stratton) 



Extensive research 
available 


Sentence completion 

Unfinished sentences 

Children, 


36 

(Psych. Corp.) 


adults 



Spontaneous paint- 

Art materials to be 

Preschool to 


1,47 

ings 

used os subject 
wishes 

adult 



Szondl (Grune and 

Photographs of psy- 



44 

Stratton) 

chotics 




Thematic Appercep- 

Pictures 

Children, 

Detects specific needs 

2, 30, 33, 40, 

tion Test (Harvard) 


adults 

and conflicts 

41,45 

Verbal summotor 

Recorded vowel 

Adults 


42 


sounds which re- 
semble unclear 
speech 




Word association 

Neutral and emo- 

Children, 

Relatively objective 

33 


tionally weighted 
words 

adults 



World test (play 

Toy objects, sufficient 

Children, 


7 

technique) 

to build a city, form, 
etc. 

adults 




Finger painting. The cliild is given “colored mud” and is allowed to make 
designs or pictures, or just to enjoy manipulating it. 

Cold cream. The child is given a jar of cold cream, placed on a floor covered 
with oilcloth, and permitted to do anything with it he wishes. 

Sticks. A finstration test (see p. 427), 

Balloons. The child plays with many balloons with the understanding that 
they may be broken. 

Toys. The child is given toys from which he may construct a home and 
people as he wishes. Dramatic play follows. 

The application of these methods to many children is shown in a series of 
films produced by Vassal- College (43). The situations bring to the surface a 
variety of attitudes: fear of criticism, acceptance of own impulses, sup- 
pressed aggression, feelings of rejection by the parents, rebellion against re- 
striction, and so on. 

Projective methods have special strengths which make them essential in 
modem research and applied psychology. They stress personality as an inter- 
related whole, rather than as a random mixture of isolated traits. They permit 
every person to have a different final analysis, corresponding to our knowl- 
edge ihat each person is unique. They tap forces which imderlie overt be- 
haviors and are otherwise not observable, and tendencies which will break 
forth under future stress though they are not yet apparent. 

The disadvantages of projective methods are marked. They are far from 
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fully objective. Their validity is probably less than perfect, and there is no 
present way of knowing just where a given analysis is incorrect. Responses 
arc distorted if subjects know much about the test and how it is used. Pro- 
jective methods are time-consuming and cannot be used without special 
training. No satisfactory method for objectively summarizing a group of rec- 
ords has been developed. Even the greatest enthusiast for projective methods 
would agree that they provide an excellent illustration of the principle that a 
test, good in one situation, may not be satisfactory for another purpose. 

14. How might individual differences in personality be revealed in each of the 
following? 

a. Playing poker. 

b. Cooking dinner for one’s family. 

c. Writing a book review. 

15. What advantages might be expected in each of the five projective tests for 
preschool children that the other tests would not have? 

16. Under what circumstances would a slightly valid self-report test be preferable 
to a highly valid projective test of emotional adjustment? 

EVALUATION OF PERSONALITY TESTING 

The many techniques for studying personality have now been presented, 
together with what evidence there is on their validity. Evaluation is now in 
order, although it must be recognized that any summation is risky in an area 
where so many new developments are emerging. In fact, the most certain 
statement that can be made about personality testing is that it is in a state of 
flux. That is in itself a great advance. In the 20 years between Woodworth’s 
first inventory and World War II, the progress in practical testing of person- 
ality was negligible. Useful tests of attitudes and interests were designed, but 
in the more general aspects of personality only ingenious research studies on 
a small scale relieved the scene. Testers generally were immersed in self- 
report tests which had not been validated, and which in some cases had been 
shown to be completely invalid. 

In tlie surge of interest and ingenuity in testing wliich broke through in 
time to receive a practical trial during the war, enough fertile ideas appeared 
to warrant great optimism. Not all the techniques which were labeled “prom- 
isiirg” in this book will prove to be sound, but somewhere in this territory 
there must be gold. Tentative researdi suggests that the most promising 
areas for further exploration are: 

the strictly empirical self-report test (e.g.. Strong Vocational Interest, 
Minnesota Multiphasic, Classification Inventory), 
the self-report descriptive test using forced choice (Kuder Preference), 
the situational test (Observational Stress Test, OSS leadership tests), 
and projective techniques (Rorschach, TAT). 

Field observations, situational tests “measuring” limited traits, observation 
incidental to intelligence tests, and conventional adjustment questionnaires 
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will all continue to be used, but it is less probable that they will undergo 
extensive improvement. 

For the tester who must deal with personality now, rather than after years 
of further research, it is important to compare the techniques as tliey now 
stand. Under present condition.s, all techniques have their uses. Self-report 
tests are the only method that can be administered briefly, though the better- 
designed personality questionnaires require about ns much of the subject’s 
time as the projective tests. Projective tests, however, retpiire more time for 
scoring, and usually must be given individually. Observation in the field is 
rarely practical save for research; a total of two hours of obseiTing per sub- 
ject is generally required to estimate even one trait accurately. Ratings are 
convenient enough, but of dubious value unless raters are carefully ti'ained 
and have good opportunity to obser-ve. 

In mass processing for screening of employees or students, the choice lies 
between empirically validated self-report tests and the Group Rorschach. 
The former have demonstrated value — provided that the particular test 
chosen has been validated in the specific situation in which it is being used. 
The Group Rorschach must certainly be validated where it is applied; Mun- 
roe’s single study in college prediction is favorable, but .shows only that this 
technique is worthy of trial by otlier personnel workers. Imperfect tech- 
niques may be tolerated in selection if they improve the cpiality of the group 
screened. They may be tolerated in psychiatric screening only if test indica- 
tions are carefully verified. 

In guidance and clinical work, a test with low validity can do great dam- 
age. Self-report tests can be tnisted only as guides to interviewing, never as a 
basis for decision. The only present methods which offer promise of sub- 
stantial validity are behavior observations, and of these, projective tests have 
the advantage because they can be most easily administered. It can only be 
concluded that there is at present no method of firmly established validity 
capable of measuring or describing personality. It is probable that situational 
tests and projective tests will be made valid and practical, but research must 
demonsti'ate where they are valid and where they err. 

A change in personality tests is their increasing length. Very brief ques- 
tionnaires had real value in military screening, with a draft population. Rut 
the questionnaires which seem to offer the greatest individual validity 
(Strong, Kuder, Multiphasic, Humm-Wadsworth, etc.) are extremely long, 
running toward 500 items. Short situational tests have been reliable, but to 
obtain a valid e.stimate of any moderately generalized trait it is often neces- 
sary to combine six to ten tests using different stimuli. Projective tests usually 
require around an hour to give, and the interpreter must work for a much 
longer time to dig the full meaning from the responses. The capstone is 
Murray’s OSS method. Here the mvestigators needed a dependable measure 
of all aspects of personality and were willing to sqiend the time needed to 
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get it. The time for each man — three days and nights. Obviously, measure- 
ment of personality is no simple matter, 

BROAD ISSUES IN TESTING 

This book was organized by separating tests of ability from tests of typical 
behavior. That division served to permit a systematic attack on the mountain 
of tests to be studied, but it has long since become obscured by the numerous 
tests that crowd the border line. The Wechsler test of ability turns out to be 
a major clinical method for studying the personality of abnonnals, and per- 
haps of normals. The Rorschach test of personality proves to be especially 
useful as an indication of intellectual functioning. Such contradictions doom 
conventional thinking about ability and personality as separate entities. In- 
telligence, instead of being an inborn fixed quality, is increasingly found to 
be a growing, changing process, molded by anxiety, emotion, security, and 
other feelings. While a theory of hereditary capacity is still tenable, no theory 
of operating intelligence can pretend to isolate it from emotional factors. 
As clinical and experimental studies capitalize on new techniques for study- 
ing how intellects function, the separate aspects of ability and behavior will 
increasingly merge into a concept of the person as a whole. 

Changing psychological theoiy and problems will influence testing, Tlie 
industrial tester and the educational tester will still need tests of single 
variables for particular predictions. The clinician is moving rapidly toward 
tests of complex and interacting functions. The battle between those who 
wish their tests pure, quantitative, and precise and diose who want their tests 
practical is still going on. Practical experience with tests of single isolated 
abilities and of specific quantitatively measured personality traits is unen- 
couraging to the single-trait emphasis. For general intelligence, motor be- 
havior, and personality, the tests which serve practically are usually those 
which resemble human functioning in life situations. And that functioning is 
invariably complex. It may be that further research will permit the factorially 
pure test and the single-trait test to fulfill practical needs of the clinician, 
educator, and industrial psychologist. That day seems likely to be remote. 

The conflict between objectivity and subjectivity is another matter. So far, 
subjective tests can do things no objective method can. But subjectivity im- 
plies error, and it would be regrettable to conclude that errors cannot be 
reduced. In any variable, subjective recognition of differences must precede 
objective measurement. The present emphasis on subjective analysis of per- 
sonality may well be a primitive stage, which can be largely objectified 
through further research. Objectivity must not be attained through discard- 
ing valid types of information. But neither may clinicians rest content with 
what Wechsler has called “inspired guesswork.” As the subjective interpreter 
can reduce his processes to objective form, he reduces the possibility of 
error, makes it easier to train additional interpreters, and increases the value 
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of his procedures. The statistician and psychometriciaii has a similar respon- 
sibility, to broaden his methodology to attack the variables oi greatest sig- 
nificance in analyzing human behavior. Murphy places the conflict in per- 
spective as follows ( 29, p. 668) : 

“The impression has got abroad that there is an antithesis between ponsmiality 
measurement and an approach in terms of structure. Yet measurement supports 
rather than negates the emphasis upon wholeness. When coneeniod with struc- 
ture, we wish, of raursc, to observe structural relations long and carefully before 
we attempt to decide what is to be measured and in what way. How far we can 
go with measurement is always an empiriciil question. We are lietwccn Scylla and 
Charybdis; for the people who love to iuialyze and measure, who give us the 
beautiful factorial analyses of ‘persistence,’ ‘radicalism,’ etc., arc seldom interested 
in the mode of articulation between traits — ^mostly the measurers get stuck in 
measuring — and the .structuralists are too enamored of relation.ships to bother 
with precise measurements, however quantitative their statements may be im- 
plicitly." 

All in all, psychological testing is an accomplishment its developers may 
well boast of. Errors of measurement have been reduced year by year, and 
the significance of tests has been increased, until today all facets of American 
society feel the impact of the testing movement. The school has perhaps been 
most radically aflected. But industry, marriage, governmental policy, and 
character-building agencies have all been studied or directly improved by 
means of tests. Testers are daily making the lives of people better by guiding 
a man into a suitable lifework, by placing an adolescent under therapy which 
will avert mental disorder, or by detecting causes of a failure in school which 
could turn a child into a beaten individual. Methods are now available 
which, if used carefully by responsible testers, can unearth the talents in the 
population and can identify personality aberrations which would cause those 
talents to be wasted. Building on these techniques, we are in a position to 
capitalize as never before on the richness of human resources, 
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425, 453, 454 

College Board Scholastic Aptitude Test, 72 
College, success in, 58, 123, 146, 267, 
284, 351, 445 

College students, counseling of, 881-382 
Color vision, 230 

Communism, attitude toward, 369 
Complex Coordination Test, 199, 219-221, 
252 

Concept formation, 425-426 
Consistency, internal, 78-79 
Cooperative Tests, 284-286, 292, 348 
Coordination, conmlex, 199, 219-220, 252 
Cornell-Coxe Permrmance Ability Scale, 
4, 162 

Cornell Selectee Index, 325-326 
Correction for guessing, 90 
Correlation, 35^9, 247 
multiple, 248-249, 252, 254-256 
Cost of test, 45 
Counseling, 352-353, 356-366 
non-directive, .358-^62 
prescriptive, 36.3-364 
Cowan Adolescent Adjustment Analyzer, 
327 


Cox Test of Mechanical Aptitude, 226 
Criterion, 5.5-58, 15,3-1.5.5, 242-244, 259- 
260 

Critical requirements, 7 
Critical score, 249-251, 2,54-256, 3,31 
Cultural factors, 49, 116-117, 164, 446 
CuiTicular validity, 274 
Curriculum, inlluence of tests on, 270-271, 
281-282 

Defectb’cs, menial, 12,3-124, 130 
Dentistry, 218 

Detroit Clerical Aptitudes Examination, 
265 

Detroit First-Grade Inlullig('ncc Tests, 69- 
70, 80 

Detroit Tests of Learning Aptitude, 136 
Dexterity, 19, 217-219, 223 
Diagnosis, 10, 1.50-158, 233-234 
educational, 284, 289-292 
of mental fimelions, 131-133, 146-148, 
15,5, 178-179 

psychiatric, 149-150, 1.54, 329, 443-444 
Diagnostic Heading Tests, 292 
Differential Aptitude Tests, .53, 233 
Diinculty, 63-65 
Directions, 88 4) 1 

Discrimination Heaction Timi> Test, 219, 
2,52 

Distribnlioii, frequency, 2.3-26, .5.5-6.5 
Dominance, 313, 326 
Downey Will-Tempernmenl Test, 419 
Draw-a’-Man Test, 1 16, 162 
Dunlap Academic Preference Blank, 350 
Durrell Analysis of Heading Dilliculty, 
290-292 

Dvorak-van Wagenen reading test, 288, 
292 

Dynamic concept of personality, 423-425 

Educational age, 282 
Emotional reactions to tests, 94-96, 358 
Empirical .9cnle.s, 309, 314, 334 
Endurance, 419 

Engineering, 214, 226, 260, 205, 340 
Engineering and Physical Science Aptitude 
Test, ,53, 265 

English tests, 74, 283, 286 
Equal-appearing units, 369-371 
Equividence, coefficient of, 6.5-70 
Equivalent forms, 46, 66, 70 
E.R.C. Stenographic Aptitude Test, 244- 
247 

Errors, of measurement, 59-65 
sampling, 39, 386-387 
Essay tost, 51, 74, 278-279 

"Face” validity, 47 

Factor analysis. 151, 192-212, 310-817 
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Fac'toriiil purity, 59, 199, 200 
Factoriiil viiliclity, 59 
Faculty psychology, 194 
Faking, 307, 820 
Fatigue, 86 

Feeble-inincledness, 123, 130 
Finger dexterity, 218, 221, 223, 252 
Forced-choice item form, 51, 326, 400 
Forecasting ofBcicncy, 38-39 
Foreign language, 264, 266, 279 
Frequency distribution, 23-26, 63-65 
Frustration, 15, 95, 311, 426-427 

Gates reading tests, 63, 293 
General Educational Development Tests, 
283, 286 

General factor, 194 

General intelligence, 106-107, 164—165, 
195-196, 208 
See alto Intelligence 
Genius, 124 

Germany, testing in, 73-74, 427 
Goldstcin-Scheerer-Weigl Test, 426 
Goodenoiigh Drnw-a-Man Test, 116, 162 
Grade norms, 75, 280-281 
Grades as criteria, 56, 259-260 
Graphology, 406 

Gray Oral Reading Paragraphs, 293 
Group factor, 194 
Group Rorschach Test, 441 
Group Target Test, 187 
Group tests, 16 

intelligence, 104-105, 172-190 
Guessing, 90-91 
“Guess Wlio” test, 407 
Guilford-Zimmerman Aptitude Survey, 
233 


Haggcrty-Olson-Wickman Behavior Rating 

Schedule, 400 

Hand-Tool Dexterity Test, 220 
Handwriting, 74, 406 
Hanfmann-Kasanin Test, 426 
Plealy Picture Completion Test, 164-166 
Hearing, 72 

Henmon-Nelson Test of Mental Ability, 
36, 75, 173-179 
Honesty, 417-418 
Horn Art Aptitude Inventory, 228 
How Supervise? (test), 58-59 
Plumm-Wndsworth Temperament Scale, 
314, 320, 327. 329, 332, 333 


Indians, 116 
Individual test. 16 

Industrial testing, see Selection procedures; 
Workers 

Infants, tests for, 166-171 


Inkblots, 104 

See also Rorschach test 
Intelligence, 103, 107 
and age, 36, 142 

and occupational level, 124, 181 
and school achievement, 130, 180 
ns part of personality, 117, 425-426, 
439-440, 453-434 
general, 106-107 
high, 124-125 
low, 123-124 

Intelligence quotient, 119-125 
accuracy of, 125-127 
criticism of concept, 134-136 
increase in, 129-130 
interpretation of, 123-125 
on various tests, 126, 175-176 
stability of, 36, 127-130 
Intelligence tests, 101—191 
for young children, 166—171 
group, 104-105, 172-190 
liistorical background of, 101-105 
non-verbal, 114-115, 161, 187—190 
summary tables, 136, 162, 168, 174, 187 
Interest Index, 324, 527, 329, 350 
Interest tests, 339-354 
in counseling, 352-354 
summary table, 350 
Internal consistency, 78-79 
Introversion, 316-317, 329 
Inventory of Factors S-T-D-C-R, 317, 327, 
329 

Iowa Legal Aptitude Test, 265, 267 
Iowa Placement Examinations, 264 
Iowa Silent Reading Tests, 58, 153, 156— 
157, 291, 293 

Iowa Tests of General Educational De- 
velopment, 286 

Ishihara color vision test, 230, 298 
Items, selection of, 78-79, 108 

a an, testing in, 55, 73, 94, 189 
analysis, 238-240 
Job replica, 219-221, 240-241 
Job satisfaction, 383 

Kent EGY Test, 22-23, 172, 331 
Knox Cube Imitation Test, 161 
Kohs Block Design Test, 161 
Kuder Preference Record, 72, 77, 153, 
344-350, 352 

Kuder-Richardson formula, 66-69 
Kuhhnann-Anderson Intelligence Tests, 72, 
174-176 

Kuhlmann Tests of Mental Development, 
136 

Language, foreign, 264, 266, 279 
Law, aptitude for, 58, 265, 267 
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Leiulcrsliip, 395, 410, 427-429, 449 
Learning on psycliomotor tests, 222-223 
Lqc Test of Algebriiie Ability, 264 
Lee-Clark Reatliiig licncliness Test, 215 
Lee-Thorpe Oecnpalional Interest Inven- 
tory, 350 

Length of test, 45, 60-61 
Lewerenz Test of Fundamental Abilities 
in Visual Art, 227 
Likert technique, 871-373 
Linfert-Hierholzer Tests, 36, 167-168 
Link Spatial Relatinus Test, 248 
Logical validity, 48-55 

MacQuarrie Test for Mechanical Ability, 
201-204, 214, 218 
Mailer Character Sketches, 327 
Mann staticube, 214 
Manual dexterity, 216-223 
Marital happiness, 335 
Maximum performance, 13 
Maze test, 18, 162 

McFarland-Seitz Psychosomatic Inventory, 
377 

Moan, 29-30 
Measurement, 12 
Mechanical alulity, 200-204 
Mechanical knovvledge, 223-226, 259- 
260, 277 
Median, 28 

Medical Aptitude Test, 205 
Meier Test of Art Judgiuont, 227-228 
Memory factor, 205-211 
Mental ability, 13 

See also Intelligence; Primary men- 
tal abilities; Special ability 
Menial age, 118-120 
adult, 119 

Mental measurements yearbaok, 80 
Merrill-Palmer Scale, 168 
Metal Filing Worksainple, 220 
Metropolitan Achievement Tests, 286 
Minnesota Clerical Test, 215, 223 
Minnesota Meohanicnl Assembly Test, 201, 
202 

Minnesota Multiphasic Personality Inven- 
tory, 314, 319-322, 327, 329 
Minnesota Paper Form Board, 201, 203, 
214, 223 

Minnesota Preschool Scale, 115, 168 
Minnesota Rate of Manipulation, 216-217, 
223, 298 

Minnesota Spatial Relations Test, 201, 202, 
223 

Minnesota Survey of Opinion, 193, 372 
Monroe-Sherman reading test, 290, 293 
Morale, 383 
Mosaic test, 450 


Motivation for test-taking, 91-96, 242 
Multiple-clioico item form, 51, 73, 278- 
279 

Multiido correlation, 248-249, 252, 254- 
256 

Multiple-faelor analysis, 196, 197 
Music, prognosis of success, 58, 231-232 
Musical Talent, Scashoro Measures of, 58, 
231-232, 267 

Nature of Proof tests, 285 
Navy, U.S., 49, 193, 194, 198-199, 255, 
258-263, 270-271, 277, 331 
Nelson-Dennv Reading Test, 53, 288, 293 
Neurotic stales, 149, 317, 329, 443-444 
Non-directivc couaseling, 358-362 
Non-verbal tests, 114-115, 161, 187-190 
Nonnul distribution, 24, 29, 34, 63-65 
Norms, 28, 75-77, 155-158, 280-281 
Number factor, 205, 206, 210 
Nursing, aptitude for, 265 

Object-Fitting Test, 114 
Objectives, ('ducational, 266 
Objectivity, 16, 73-74, 453-454 
Observation, 14-15, 310-312' 
during tests, 181-133, 144-145, 162, 
213 

in field conditions, 295, 386-407, 413 
in standard situations, 413-481 
Observational Stress Test, 414 
Occupational level and intelligence, 124, 
181 

O’Connor Finger Dexterity Test, 218, 223 
O'Connor IVcezer Dexterity Test, 19, 219, 
223 

O’Connor Wiggly Block, 201, 213-214 
Office of Strategic Services, 410, 428-431 
Ohio State University Psychological Ex- 
amination, 174, 179, 180, 186 
Orthoratcr, 58, 230 

Otis intelligence tests, 36, 58, 173-175, 
210, 223, 249, 298 

Paired-comparisons method, 375 
Pnper-and-pencil test, 16 
Percentile rank, 25-28, 33-34 
Perceptual ability, 201-206 
Perceptual tests, 213-215 
Performance tests, 6, 46, 51, 161-166 
correlation with Binct, 163 
summary table, 162 
Persistence, 418-422 
Personal Audit, 326, 335 
Personal constant, 136 
Personal Data Sheet, 312 
Personal Inventory, 325-326, 330-331 
Personality, 305-336, 353-354 



SUBJECT INDEX 


473 


Personality — ( Continued ) 

and intelligence, 117, 425-426, 439- 
440, 453-454 
as a whole, 423-425, 436 
stability of, 328, 403-404, 421 
trait theory of, S15-31S 

See also Observation; Reputation 
Personality tests, self-report, 313-336, 
451-453 

summary tabic, 326-327 
validity of, ,328-336' 

Sec also Prognostic tests; Situational 
tests 

Picture-Frustration Test, 450 
Pilot, see Aircrew 
Pintner Aspects of Personality, 327 
Pintner General Ability Tests, 187 
Pintner-Paterson Performance Tests, 162, 
165-166 

Pitch discrimination, 49, 63-64, 231-232 
Point scale, for Binet test, 136 
Porteus Maze Test, 18, 162 
Power test, 17 
Prediction, 9, 38, 237-264 
Preschool intelligence teats, 115, 166-171 
summary t’able, 168 

Prosscy Interest-Attitude Test, 347-348, 
350 

Primary mental abilities, 105, 204-211, 
233 

tests of, 53, 153, 205-209 
Principles, ability to apply, 226, 276, 284- 
285 

Products, personal, 405-407 
rating of, 294-295 

Professional aptitude tests, 264r-265, 458 
Profile, 34 

reliability of, 150-153 
Progressive Achievement Tests, 75, 286 
Progressive Education Association Tests, 
284-285, 324, 327, 372 
Prognostic tests, 17, 264-266 
summary table, 264-265 

See also Aptitude tests; Prediction 
Projective techniques, 311, 433-451 
summary ’table, 450 
Psychiatric diagnosis, 443-444 
Psyohodrama, 428 
Psychoniotor abilities, 216—223 
Psychopathic personality, 146, 1.54, 329 
Psychosomatic Inventory, 377 
Publishers, 460 
Pure tests, 59, 199-200, 453 
See also Factor analysis 

Qualitative analysis of personality, 132, 
423-429 


Radiomen, 247, 253, 263 
Range, effect on reliability and validity, 
62, 260-263 
Rank, 22, 25 
Rapport, 92-93 
Rating, 294—404 
as criterion, 56 
of products, 294 
Rating scales, 397—401 
Raw score, 20-22 
Reaction time, 217, 219 
Reading element in group tests, 180-181 
Rending readiness, 215 
Reading tests, 58, 267, 283, 287-293 
summiury table, 2921-293 
Reasoning ability, 205-207, 285, 425—426 
Recall vs. recognition, 73, 278-279 
Records, anecdotal, 395-396 
persona], 405-406 
Recording of observations, 391-895 
Regression, 127 

Reliability, 59-73, 1,50-153, 356 
Reputation, 407-410 
Research, 11 
Response sets, 50-53 

Rogers Tost of Personal Adjustment, 323— 
324 327 

Rorschach test, 312, 434^-446, 450 
Rotary Pursuit Test, 218, 221, 262 
Rudder Control Test, 221, 2,52 

Salesmen, selection of, 55, 266, 334, 351 
Sampling error, 39, 386-387 
Sampling of content, 273-274 
Sangren- Woody Reading Test, 293 
Scales, see Attitude tests; Rating scales, 
etc. 

Scatter, 131 

Scatter diagriuns, 36-37, 40 
Schizophrenia, 55, 132, 146-147, 149, 154, 
333, 426, 443, 445 

School, use of achievement tests in, 259- 
• 260, 299-300 

Sdiool achievement, and intelligence, 180 
prediction of, 130, 182, 209-210, 267, 
333, 3,50-351 
See also College, success in 
Science tests, 264, 265, 276, 279, 284- 
285 

Scientific Aptitude Test, Stanford, 53-54, 
265 

Score, corrected for guessing, 90 
raw, 20-22 
standard, 31-34, 135 
Scoring, 45-46, 73-74, 179 
Scott Three-Hole Test, 218 
Seashore Measures of Musical Talent, 58, 
231-232, 267 
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Seashore Measures — ( Continued ) 

Sea also Pitcli discrimination 
Selection of workers, 214-224, 238-244, 
■ 251, 298-299 

aircrew, 3, fi-S, 73, 215, 219-221, 249. 

252-253, 261, 333, 351, 414 
clerical, 215, 218, 244-247, 253, 26.5- 
267, 334 

military specialists, 73-74, 105, 247, 
253, 263 

salesmen, 55, 266, 334, 351 
Selection procedures, 6-8, 221-223, 237- 
264 

Selection ratio, 257-258 
Self-report tests, 15, 306-309, 452 

See also Atlitude tests; Interest 
tests; Personality tests 
Sensory abilities, 72 
Sentence completion test, 450 
Shipley Personal Inventory, 325-326 
Shopwork, 223, 226. 248, 294 
Situational tests, 413-415 
Skewed distribution, 24, 63-65 
Social behavior, 427-429 
Social distance, 375 
Social maturity, 401 
Sociogram, 408-410 
Sociometric techniques, 407-410 
Spatial ability, 53, 201-215 
Spearman-Brown formula, 61, 66, 402 
Special ability, tc.sls of, 213-234 
Speed, of reaction, 216-217 
of reading, 287-288 
of test-taking, 51-52 
Speeded test. 17, 172-173, 180, 278 
reliability, 69-70 
Split-half method, 66-67 
SRA Non-Verbal Form, 53. 187 
SRA Rending Record, 89, 91 
Stability, coefficient of, 65-70 
of personality, 328, 403-404, 421 
Standard deviation, 29-31 
Standard score, 31-^4, 135 
Standardized test, 16, 270-272, 280-281 
Stanford Achievement Test, 286 
Stanford-Binet, 36, 101-136, 140-141, 
145, 206 

correlated with group tests, 180 
correlated with Rorschach tc.sl, 439- 
440 

correlated with Wech.sler test, 126, 145 
Interpretation, 112-116, 131-133 
observation during, 131-133 
validity, 130, 162 

Stanford Scientific Aptitude Test, 205 
Stanine, 35 

Steadiness, 55, 217-218, 248 


Stenography, 298 
aptitude, 244-247 

Stenquist Mechanical Aptitude Tost, 201, 
226 

Stenquist Mechanical Assembly Test, 226, 
248, 253 

Store, department, personnel, 6, 298 
Stress situation, 414, 426-427 
Strong Vocational Interest Blank, 339-351 
Study Iiabits, 393-394 
Study of Values, 323, 326 
Structured and unstructured situations, 
415-416 

Subjectir’ity, 16. 73-74, 453-454 
Supervisors, selection of, 58 
Syinonds Foreign Language Prognosis 
Test, 264, 266 
Szondi test, 450 

T-score, 31-33 
Teachers, rating of, 382 
success of, 251 

Sec also Administration of tests; 
Grades as criti'ria; School, use of 
achievement tests in 

Tennan Group Test of Mental Ability, 
176, 249 

Terman-McNcmar Test of Mental Ability, 
53, 75, 174 

Testing, historical l)ackgronnd ol, 101- 
105, 313-31 5 

Tests, administration of, 45, 84-96 
choice of, 43-83 
classification of, 13-18 
defined, 11 

form for examining, 184 
information about, 79-80 

See also Aeliievemeut tests; Apti- 
tude tests; Intelligence tests; etc. 
Thematic Apperception Test, 4, 446-450 
Time-limit test, sea Speeded test 
Time sampling, 390-391 
Trade test, 224 
Trait. 306, 31.5-318 
True-false test fram, 50 
Tweezer Dexterity TestH^l" ' 

Two-Hand Goordination Test, 218, 221, 
252 

Typical behavior, 14, 30.5-312 
Unique factor, 194 

Unstructured and structured situations, 
415-416 

U.S, Armed Forces Institute Tests, 283, 
286 

U.S. Employment Service, 224, 233-234, 
239-240 
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Validity, 48-59, 273-280 
curricular, 274 

determination of, 55-58, 242-264 

empirical, 55-59 

“face,” 47 

factorial, 59 

logical, 48-55 

Sec also Aptitude tests; Intelligence 
tests, etc. 

Verbal factor, 201-211 

Verbal summator, 450 

Vision. 58, 230 

Vigotsky test, 426 

Vineland Social Maturity Scale, 401 

Wechsler-Bcllevue Intelligence Scale, 49, 
58, 75, 126, 140-159, 162 
clinical diagnosis from, 146-154 
correlated with Stanford-Binet, 126, 
145 


Wechsler-Bellevue — ( Continued ) 
interpretation of, 146-150 
observation on, 144—145 
profile analysis, 146-150 
Wiggly Block, 201, 213-214 
Will-Temperament Test, 419 
Word association, 450 
Word fluency factor, 205-206 
World Test, 450 

Work sample, 219, 240-241, 295-298 
Workers, morale of, 383 
rating of, 397-398 

selection of, 214-224, 238-244, 251, 
298-299 

Yale Aptitude Battery, 233 
Yerkes-Bridges Point Scale, 136 

z-score, 31-33 
Zyve test, 53-54, 265 



